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Abstract 

We present a novel graphical framework for modeling non-negative sequential data with hierarchical 
structure. Our model corresponds to a network of coupled non-negative matrix factorization (NMF) 
modules, which we refer to as a positive factor network (PFN). The data model is linear, subject to 
non-negativity constraints, so that observation data consisting of an additive combination of individually 
representable observations is also representable by the network. This is a desirable property for modeling 
problems in computational auditory scene analysis, since distinct sound sources in the environment are 
often well-modeled as combining additively in the corresponding magnitude spectrogram. We propose 
inference and learning algorithms that leverage existing NMF algorithms and that are straightforward 
to implement. We present a target tracking example and provide results for synthetic observation data 
which serve to illustrate the interesting properties of PFNs and motivate their potential usefulness in 
applications such as music transcription, source separation, and speech recognition. We show how a 
target process characterized by a hierarchical state transition model can be represented as a PFN. Our 
results illustrate that a PFN which is defined in terms of a single target observation can then be used 
to effectively track the states of multiple simultaneous targets. Our results show that the quality of the 
inferred target states degrades gradually as the observation noise is increased. We also present results 
for an example in which meaningful hierarchical features are extracted from a spectrogram. Such a 
hierarchical representation could be useful for music transcription and source separation applications. 
We also propose a network for language modeling. 
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1 Introduction 



Wc present a graphical hidden variable framework for modeling non- negative sequential data with hierarchical 
structure. Our framework is intended for applications where the observed data is non-negative and is well- 
modeled as a non-negative linear combination of underlying non-negative components. Provided that we 
are able to adequately model these underlying components individually, the full model will then be capable 
of representing any observed additive mixture of the components due to the linearity property. This leads 
to an economical modeling representation, since a compact parameterization can explain any number of 
components that combine additively. Thus, in our approach, we do not need to be concerned with explicitly 
modeling the maximum number of observed components nor their relative weights in the mixture signal. 

To motivate the approach, consider the problem of computational auditory scene analysis (CAS A), which 
involves identifying auditory "objects" such as musical instrument sounds, human voice, various environmen- 
tal noises, etc, from an audio recording. Speech recognition and music transcription are specific examples 
of CAS A problems. When analyzing audio, it is common to first transform the audio signal into a time- 
frequency image, such as the spectrogram (i.e., magnitude of the short-time Fourier transform (STFT)). We 
empirically observe that the spectrogram of a mixture of auditory sources is often well-modeled as a linear 
combination of the spectrograms of the individual audio sources, due to the sparseness of the time-frequency 
representation for typical audio sources. For example, consider a recording of musical piece performed by 
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a band. We empirically observe that the spectrogram of the recording tends to be well-approximated as 
the sum of the spectrograms of the individual instrument notes played in isolation. If one could construct a 
model using our framework that is capable of representing any individual instrument note played in isolation, 
the model would then automatically be capable of representing observed data corresponding to arbitrary 
non-negative linear combinations of the individual notes. Likewise, if one could construct a model under 
our framework capable of representing a recording of a single human speaker (possibly including a language 
model), such a model would then be capable of representing an audio recording of multiple people speaking 
simultaneously Such a model would have obvious applications to speaker source separation and simultane- 
ous multiple-speaker speech recognition. We do not attempt to construct such complex models in this paper, 
however. Rather, our primary objective here will be to construct models that are simple enough to illustrate 
the interesting properties of our approach, yet complex enough to show that our approach is noise-robust, 
capable of learning from training data, and at least somewhat scalable. We hope that the results presented 
here will provide sufficient motivation for others to extend our ideas and begin experimenting with more 
sophisticated PFNs, perhaps applying them to the above-mentioned CAS A problems. 

An existing area of research that is related to our approach is non-negative matrix factorization (NMF) 
and its extensions. NMF is a data modeling and analysis tool for approximating a non-negative matrix X 
as the product of two non-negative matrices W and H so that the reconstruction error between X and WH 
is minimized under a suitable cost function. NMF was originally proposed by Paatero as positive matrix 
factorization [1]. Lee and Scung later developed robust and simple to implement multiplicative update rules 
for iteratively performing the factorization [2]. Various sparse versions of NMF have also been recently 
proposed [3] , [4] , [5] . NMF has recently been applied to many applications where a representation of non- 
negative data as an additive combination of non-negative basis vectors seems reasonable. Such applications 
include object modeling in computer vision, magnitude spectra modeling of audio signals [6], and various 
source separation applications [7]. The non-negative basis decomposition provided by NMF is, by itself, not 
capable of representing complex model structure. For this reason, extensions have been proposed to make 
NMF more expressive. Smaragdis extended NMF in [7] to model the temporal dependencies in successive 
spectrogram time slices. His NMF extension, which he termed Convolutive NMF, also appears to be a special 
case of one of our example models in Section 6.1. We are unaware of any existing work in the literature 
that allow for the general graphical representation of complex hidden variable models, particularly sequential 
data models, that is provided by our approach, however. 

A second existing area of research that is related to our approach is probabilistic graphical models [8] 
and in particular, dynamic Bayesian networks (DBNs) [9], [10], which are probabilistic graphical models for 
sequential data. We note that the Hidden Markov Model (HMM) is a special case of a DBN. DBNs are widely 
used for speech recognition and other sequential data modeling applications. Probabilistic graphical models 
are appealing because they they can represent complex model structure using an intuitive and modular 
graphical modeling representation. A drawback is that the corresponding exact and/or approximate inference 
and learning algorithms can be complex and difficult to implement, and overcoming tractability issues can 
be a challenge. 

Our objective in this paper is to present a framework for modeling non-negative data that retains the non- 
negative linear representation of NMF, while also supporting more structured hidden variable data models 
with a graphical means for representing variable intcrdependencies analogously to that of the probabilistic 
graphical models framework. We will be particularly interested in developing models for sequential data 
consisting of the spectrograms of audio recordings. Our framework is essentially a modular extension of 
NMF in which the full graphical model corresponds to several coupled NMF sub-models. The overall model 
then corresponds to a system of coupled vector or matrix factorization equations. Throughout this paper, 
we will refer to a particular system of factorizations and the corresponding graphical model as a positive 
factor network (PFN). We will refer to the dynamical extension of a PFN as a dynamic positive factor 
network (DPFN). Given an observed subset of the PFN model variables, we define inference as solving for 
the values of the hidden subset of variables and learning as solving for the model parameters in the system 
of factorization equations. 

Note that our definition of inference is distinct from the probabilistic notion of inference. In a PFN, 
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inference corresponds to solving for actual values of the hidden variables, whereas in a probabilistic model 
inference corresponds to solving for probability distributions over the hidden variables given the values of 
the observed variables. Performing inference in a PFN is therefore more analogous to computing the MAP 
estimates for the hidden variables in a probabilistic model. One could obtain an analogous probabilistic 
model from a PFN by considering the model variables to be non-negative continuous- valued random vectors 
and defining suitable conditional probability distributions that are consistent with the non-negative linear 
variable model. Let us call this class of models probabilistic PFNs. Exact inference is generally intractable 
in such a model since the hidden variables are continuous-valued and the model is not linear-Gaussian. 
However, one could consider deriving algorithms for performing approximate inference and developing a 
corresponding EM-based learning algorithm. We are unaware of any existing algorithms for performing 
tractable approximate inference in a probabilistic PFN. It is possible that our PFN inference algorithm may 
also have a probabilistic interpretation, but exploring the idea further is outside the scope of this paper. 
Rather, in this paper our objective is to to develop and motivate the inference and learning algorithms by 
taking a modular approach in which existing NMF algorithms are used and coupled in a way that seems to 
be intuitively reasonable. We will be primarily interested in empirically characterizing the performance of 
the proposed inference and learning algorithms on various example PFNs and test data sets in order to get 
a sense of the utility of this approach to interesting real-world applications. 

We propose general joint inference and learning algorithms for PFNs which correspond to performing 
NMF update steps independently (and therefore potentially also in parallel) on the various factorization 
equations while simultaneously enforcing coupling constraints so that variables that appear in multiple 
factorization equations are constrained to have identical values. Our empirical results show that the proposed 
inference and learning algorithms arc fairly robust to additive noise and have good convergence properties. By 
leveraging existing NMF multiplicative update algorithms, the PFN inference and learning algorithms have 
the advantage of being straightforward to implement, even for relatively large networks. Sparsity constraints 
can also be added to a module in a PFN model by leveraging existing sparse NMF algorithms. We note that 
the algorithms for performing inference and learning in PFNs should be understandable by anyone with a 
knowledge of elementary linear algebra and basic graph theory, and do not require a background in probability 
theory. Similar to existing NMF algorithms, our algorithms arc highly parallel and can be optimized to take 
advantage of parallel hardware such as multi-core CPUs and potentially also stream processing hardware 
such as GPUs. More research will be needed to determine how well our approach will scale to very large or 
complex networks. 

The remainder of this paper has the following structure. In Section 2, we present the basic PFN model. 
In Section 3, we present an example of how a DPFN can be used to represent a transition model and 
present empirical results. In Section 4, we present an example of using a PFN to model sequential data 
with hierarchical structure and present empirical results for a regular expression example. In Section 5, 
we present a target tracking example and provide results for synthetic observation data which serve to 
illustrate the interesting properties of PFNs and motivate their potential usefulness in applications such as 
music transcription, source separation, and speech recognition. We show how a target process characterized 
by a hierarchical state transition model can be represented by a PFN. Our results illustrate that a PFN 
which is defined in terms of a single target observation can then be used to effectively track the states of 
multiple simultaneous targets in the observed data. In Section 6 we present results for an example in which 
meaningful hierarchical features are extracted from a spectrogram. Such a hierarchical representation could 
be useful for music transcription and source separation applications. In Section 7, we propose a DPFN for 
modeling the sequence of words or characters in a text document as an additive factored transition model of 
word features. We also propose slightly modified versions Lee and Seung's update rules to avoid numerical 
stability issues. The resulting modified update rules arc presented in Appendix A. 
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2 Positive Factor Networks 

In this section, we specify the basic data model and present a graphical representation. We then propose 
inference algorithms for solving for the hidden variables and learning the model parameters. 

2.1 Model specification 

We now specify a data model for a set of non-negative continuous vector- valued variables {xi : i = 1, . . . , N} 
where the dimension of each Xi, in general, can be distinct. We will refer to the set {xi} as the model 
variables. We assume that a subset Xe of the model variables is observed and the rest of the variables 
comprise the hidden subset, Xh- 

The model is specified by a system of Q of non-negative factorization equations where the j'th equation 
is given by: 



X f(J,0)=^2 W k X fU,k) 

fc=i 



w( wi 



--W j x j 



wl 



x fUA) 

X fU,2) 

x fU,Pi) 



(2.1) 



where Pj > 1 for all j G {1, . . . , Q} and all Wl are non-negative matrices with a possibly distinct number 
of columns for each k. The function /(j, k) maps each (j, k) above into a corresponding index i G {1, . . . , N} 
and satisfies f{j,k) = f(m,n) implies j ^ m. Thus, £/(j,fc) refers to one of the model variables Xi, and it 
is possible for a given variable Xi to appear in multiple equations above but only at most once in any given 
equation. The matrix W J is defined as the horizontal concatenation of the non-negative Wl matrices. 

Since there are no constant terms in the above equations, the system corresponds to a a homogeneous 
system of linear equations, subject to a non-negativity constraint on the variables {x^, which can be written 
as: 



Ay = , where y ■■ 



X\ 

XN 



>0 



(2.2) 



Note that A will contain both negative and positive values, however, since all of the terms are grouped 
together on one side. It follows that our model satisfies the linearity property subject to non-negativity 
constraints (non-negative superposition property): if y± and y^ are solutions to the system, then aij/i -Va-iVi 
is also a solution, for any choice of scalars a,\ > 0, «2 > 0. 



2.2 Graphical representation 

We now develop a graphical representation to facilitate the visualization of the local linear relationships 
between the model variables in the various factorization equations. Hidden variables correspond to shaded 
nodes in the graph, and observable variables correspond to unshaded nodes. We construct the graphical 
model such that the j'th factorization equation corresponds to a subgraph with £/(j,o) ^ the child node and 
the {xf(j.k) '■ k = 1, . . . , Pj} as the parent nodes. Thus, a child node variable is a linear function its parent 
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nodes in the subgraph. We then arrive at the complete graphical model by superimposing the subgraphs 
corresponding to the various factorization equations. We allow the same variable Xj to correspond to the child 
node in multiple subgraphs (i.e., distinct factorization equations), provided that its parents do not overlap 
between the subgraphs. That is, we do not allow a single arc to correspond to multiple linear relationships. 
We annotate arcs with dash marks where necessary, to disambiguate subsets of parent nodes that correspond 
to the same factorization equation. Arcs annotated with the same number of dash marks connecting a child 
node to its parents denote a subset of parent nodes that corresponds to a single factorization equation. The 
pseudocode in Algorithm 1 outlines a procedure for creating a graphical model representation from a system 
of factorization equations. 

Algorithm 1 Create a directed graph from a system of factorizations. 
for j = 1 to Q 

jj for each factorization equation Xf(jo) = Sfeii ^k x f(j,k) 

Create a node for the corresponding variable x/(j,o)j if it does not already exist. 

dashCount <— 1 + maximum dash count on any existing arc from a parent node of £/(j,o) to #/(j,o)- 

for k = 1 to Pj 

if a node corresponding to Xf(j,k) does not already exist 
Create a node corresponding to Xfu^) 
if Xf(j^k) is observed 

Shade the node corresponding to Xfu y k) 
end 
end 

Create a directed arc from the a;/(j,fe) node to the £/(j,o) node. 
Annotate the arc with the number of dash marks given by the current value of dashCount. 



end 
end 



Annotate the arc with W[ 



As an example, consider the set of variables {x\, x 2 , • • ■ , ^12} that are described by the following system 
of factorization equations: 

x x =Wix b + W 2 x 6 (2.3 

x 2 =W 3 x 6 + W A x 7 (2.4 

x 2 =W b x B + W 6 x g (2.5 

x 2 =W 7 x 10 (2.6 

x 3 =W s x 10 (2.7 

x 3 =W 9 x n (2.8 

xa =W w xn (2.9 

x 6 =W llXl2 (2.10 

£io =Wi 2 xi 2 (2.11 

£11 =W 13 x 12 (2.12 

Figure 1 shows the graphical model associated with the above system of factorizations. Note that nodes x 3 
and X4 are connected by a dashed line. This is an optional annotation that specifics that the corresponding 
connected variables have forced factored co-activations, or simply forced co-activations. So far the only 
constraint we have placed on the parameter matrices {Wj} is that they be non-negative. However, in some 
models we might wish to place the additional constraint that the columns {w n } of Wi are normalized in 
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Figure 1: A graphical model corresponding to the example system of factorizations in Equations 2.3 - 2.12. 

some way. For example, we might consider requiring that each w n have column sum = 1. Consider the 
subgraph corresponding to Equations (2.8) and (2.9). The dashed line connecting X3 and X4 specify that 
these variables are factorized in terms of a common parent (xu in this case) and that the columns of the 
corresponding {Wi} (Wg and W10 in this case) are normalized to have unit sum. This constraint ensures 
that these three variables will have equal column sums. Thus, if any one of these variables is observed in the 
model, then we can infer that all three must have the same column sum. An activation of the parent X\\ 
implies that its children x 3 and X4 will then be activated with the same column sum (activated together), 
hence the term forced co-activations. 

Consider the subgraph corresponding to Equation (2.3). An activation in x\ (i.e., x\ is nonzero) could 
be explained by just one of the parents being nonzero. Now consider the subgraph consisting of x-i and its 
parents, corresponding to Equations (2.4), (2.5), and (2.6). In this case, an activation of x-i corresponds to 
at least one of Xg and x-j being activated, at least one of x$ and xg being activated, as well as x 10 being 
activated. 

We borrow the notion of a plate from the probabilistic graphical models formalism [8] , in which plates are 
used to represent replicated random variables. A plate consists of a box that is drawn around the replicated 
variables, with the replication count specified in a corner. Figure 2 shows an example of using the graphical 
plate notation. This graphical model corresponds to the following system of factorization equations: 



x\ 



--W lX { 
--W lX \ 



'N 



-W lX 



Letting X 1 



<: X N ] , and letting X 2 = [ x\ 



system of factorization equations more compactly as the matrix equation: 



(2.13) 
x% ] , we can then write the 



X 1 = W X X 2 



6 



(2.14) 
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Figure 2: An example of using plate notation, corresponding to -/V observable variables {x}} and N hidden 
variables {x 2 }. This model corresponds to standard NMF. 

Note that this corresponds to standard NMF, since the observed X 1 matrix is factored as the product 
of two non-negative matrices. 

2.3 Algorithms for Inference and Learning 

Typically, a subset of the model variables Xe = {xE n '■ n = 1, . . . , Ne} is observed, and we are interested in 
solving for the values of the hidden variables Xh = {xn n '■ n = 1, . . . , Nh}, which we refer to as the inference 
problem. We are also interested in solving for the values of the model parameters 9 — {W 1 , W 2 , . . . , VF^}, 
which we refer to as the learning problem. The observed variables may deviate from the modeling assump- 
tions and/or contain noise so that an exact solution will not be possible in general. We will thus seek an 
approximate solution by optimizing a reasonable cost function. 

Given a subset of observable variables and the model parameters, we define the inference solution as 
the values of the hidden variables that minimize some reasonable cost function g{9, Xh,Xe). That is, we 
consider Xe and 9 to be fixed and solve for Xh'- 



X H = argmmg(9,X H ,X E ) 

X H 



(2.15) 



The learning problem corresponds to solving for 9 given observations Xe ■ The joint learning and inference 
problem corresponds to minimizing g(9, Xh, Xe) jointly for 9 and Xh given Xe'- 



(9,X H ) =a.Tgmmg(9 7 X H ,X E ) 

e,x H 



(2.16) 



The cost function should have the property that its value approaches zero as the approximation errors 
of the system of factorization equations approach zero. One possibility is the following function, specified as 
the sum of the squared approximation errors of the factorization equations (2.1): 
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i(e,x H ,x E )=J2\\ x fu.°)- wixJ \\ 2 



= X>/w,o)-w' 

3 = 1 



x fUA) 

X fU,2) 

x f{J,Pj) 



(2.17) 



Other possibilities could involve using the generalized KL-divergence (A. 3), for example. However, we 
will not develop inference and learning algorithms by directly optimizing any particular cost function, and so 
will not be too concerned with its exact form. Rather, we propose algorithms that seem intuitively reasonable 
and for which we have observed good empirical convergence properties on test data sets, but do not offer 
any proof of convergence, even to a local minimum of any particular cost function. We will only make use 
of a particular cost function in order to quantify the empirical performance of our algorithms. 

We require that the graphical model correspond to a directed acyclic graph (DAG). Since there are no 
cycles, the graph can be arranged so that it flows from top to bottom. We then rename each of the model 
variables {x n : n = 1, . . . , N} to x\ where I = 1, . . . , L denotes the level of the variable (vertical index), and 
i is its position within the level (horizontal index). We then have a graph with L levels, such that the top 
level nodes (level L) have no parents and the bottom level nodes (level 1) have no children. A level is defined 
such that no pair of nodes with a parent-child relationship arc allowed to be in the same level. That is, for 
any pair of variables (x\, Xj), i ^ j in the same level I, we disallow that x\ is the parent of Xj or vice versa. 
If an arc exists between a higher level variable and a lower level variable, then the higher level variable must 
be the parent of the lower level variable. That is, we require that for any two variables (x™, a;™), s.t .m > n, 
x™ cannot be a child of x™. For example, the graph in Figure 1 corresponds to an L = 3 level graph with the 
following renamed variables. Variables (xi, . . . , X4) would be renamed to (x\, . . . , x\). Variables (X5, . . . , in) 
would be renamed to (x\ , . . . , x^). Variable X12 would be renamed to x\. 

The inference and learning algorithm will require a set of local variables {v J k }. We use the term local 
variable because a given vi is associated only with the j'th factorization equation, unlike a model variable 
x fU-k) which is allowed to appear in multiple factorization equations. Specifically, a distinct v k will be 
associated with each allowable combination of j and k that appears in the factorization equations in (2.1). 
Thus, several distinct v k may be associated with a given model variable and this will be the case when a given 
model variable appears in more than one factorization equation. Replacing the Xfrj,k) with the associated 



'k- 



we then have Q factorization equations Factor System — {eqj : j = 1, . . . , Q} where equation eqj is given 



by: 



vi=j2w>i 



fe=i 



w{ w£ 



wk 



=W J v 3 
We say that the above equations are in a consistent state if each model variable Xfij k \ 



(2.18) 



and all of 



its associated local variables {v J k } have the same value. Otherwise, we say that the equations are in an 
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inconsistent state. Note being in a consistent state does not imply that the corresponding Xf(j^) actually 
constitute a solution to the system. 

The basic idea of our approach is to learn a generative model of data by itcrativcly performing inference 
and learning in a bottom-up pass through the network and then perform data generation through activation 
propagation in a top-down pass. In the inference and learning pass, parent node values (activations) and 
parameters are updated based on the values of their child node variables. These inference and learning 
updates are performed locally using NMF algorithms. Once the top level nodes have been updated, we 
then propagate values downward to the lowest level nodes in a data generation pass. This is performed by 
computing the new child node values as the value of the right hand side of the corresponding factorization 
equations in which they appear. Throughout this process, the multiple vj, variables that correspond to 
a single model variable a;/(j.fc) are repeatedly replaced by their mean value in order to put the system of 
factorization equations back into a consistent state. This process of a bottom-up inference and learning step 
followed by a top-down value propagation (data generation) step is iterated until convergence. 

Algorithm 2 and the corresponding procedures in Algorithm 3 and Algorithm 4 show the pseudocode for 
the basic inference and learning algorithm that was used to obtain all of the empirical results in this paper. 
We start by initializing the hidden variables Xh to small random positive values. We then make the system 
consistent by copying the value of each model variable Xfu t k) hito each of its corresponding local variables 
v J k . The only distinction between hidden and observed variables from the perspective of the learning and 
inference algorithm is that model variables in the observed set Xe are never modified in the algorithm. In 
computing the mean values of the variables, the new mean for the model variables is computed only from 
the subset of the local variables that were updated in the corresponding inference or value propagation step. 
After performing the value propagation step, the updated values of all lower level variables in the model are 
a function of the top level variables. 

When a parameter matrix W is tied across multiple vector factorization equations, the corresponding 
equations can be merged into a single matrix factorization equation, as we did in going from (2.13) to (2.14). 
This merging will also be possible in the dynamic models that we present starting in Section 3. In this 
case, the learning update of the upStepQ procedure is simply modified to perform a single NMF left update 
instead of multiple left up steps. Appendix C describes a similar inference and learning algorithm in which 
the mean values of the variables are computed differently. 

Algorithm 2 Perform inference and learning 

Initialize hidden variables to random positive values 

// Main loop 

repeat 

// Bottom-to-top inference and learning 
for I = 1 to L - 1 
upStep(Z) 
averageParents(Z) 
end 

// Top-to-bottom value propogation 
for I = L — 1 downto 1 
downStep(l) 
averageChildren(Z) 
end 
until convergence 



Note that the only distinction the algorithm makes between Xe and Xh is that the values of the Xe 
variables are not updated during inference. If we wish to make some previously hidden variables observed or 
vice versa, is is then trivial to modify the algorithm to handle this by simply enabling or disabling updates 
on those variables. 
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Algorithm 3 upStepQ and downStepQ procedures 



// Using the values of the child variables at level I, update the values of the parent variables and 

// update the parameter matrices by performing NMF update steps. 

// Let Xi denote the set of model variables {x[, x l 2l . . . , x l LevelCount } corresponding to the level I nodes. 

// Let Factor Systemi denote the subset of the factorization equations from (2.18) 

// {eqj : such that Vq in eqj corresponds to an x\ G X{\. 

II Let duplications etC hild(i, I) — {j : eqj G Factor Systemi and Vq corresponds to x\}. 

upStep(7) 

for each j G Factor Systemi 
if learning is enabled 

Learning update: Using v 3 Q = IU J V, perform a left NMF update on W^ , using, e.g. (A. 7) 
end 

Inference update: Using v J Q — W 3 v 3 , perform a right NMF update on v J , using, e.g. (A. 6) 
end 
end 

// Using the values of the parent variables, update the values of the level I child variables by performing 

// value propagation. 

downstep(Z) 

for each j G Factor Systemi 

If Perform value propagation 
W j v j 



end 



end 
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Algorithm 4 averageChildrcn() and averageParents() procedures 

// Let Xi denote the set of model variables {x\, x l 2 , ■ ■ ■ , x l LevelCount } corresponding to the level I nodes. 

// Let duplicationCountChild(i,l) — \duplicationSetChild(i,l)\. 

II Let duplication Set(i, I) — {(j, k) : eqj G Factor System and v k corresponds to x\}. 

II Update the value of each x\ G X\ as the mean value of the corresponding v J 

II that appear in Factor System^ Then set all v k corresponding to x\ to this value as well. 

averageChildren(Z) 

for i = 1 to LevelCounti 

If For each (child) variable x\ G X[ 

if x\ is hidden 

mean Value = duplicationCountC hiid{i,i) l^jedu P iicationSetchiid{i,i) v o 
for each (j,k) G duplication Set(i, I) 

vi *— meanValue 
end 

x\ <— meanValue 
else if x\ is observed 

for each (j,k) G duplication Set(i, I) 



end 



x\ 



end 
end 
end 

// Let Xi + i denote the set of model variables {x 1 -^ 1 , x 1 ^ 1 , . . . , x l ^ velCount } corresponding to the level / + 1 nodes. 

// Note that it is possible for a level I variable to have some or all of its parents in a level higher than I + 1, in 

// which case the set X/ +1 will only represent a proper subset of the parents of level I. 

If Let duplications 'etParent(i, I + 1) = {(j, k) : eqj G FactorSystemi and v 3 k , k > 1 corresponds to x^ }. 

// Let duplicationCountParent(i, I + 1) = \duplicationSetParent(i, I + 1)|. 

// For each x' +1 G Xi + i, update the corresponding v k , k > 1 to their mean value. 

averageParents(7) 

for i = 1 to LevelC ounti + i 

II For each (parent) variable x t +1 G Xi 



if x t +1 is hidden 

duplicul ionCou n i Pa ri n!(i,l + l) ^—^ 



duplicationCountParent(i,l + l) Z—t(j,k)£duplicationSetParent(i,l+l) k 



for each (J, k) G duplications et{i 1 1 + 1) 



v k *— meanValue 
end 

x t + <— meanValue 
else if xv~ is observed 



for each (J, k) G duplication Set{i, I + 1) 



end 
end 
end 
end 



vi - x \ +i 
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We also note that typically, some kind of normalization of the parameter matrices W will be performed 
during the left NMF update steps. For example, one could normalize the column of W to have unit sum. 

Note also that inference and learning are performed jointly. We can perform inference only by simply 
disabling the learning updates. If a subset of the parameters 9 are known, then the learning updates can be 
disabled for those parameters. 

The computational cost of performing a single iteration of the inference and learning algorithm is given 
by the sum of the costs of performing the learning and inference NMF updates and the value propagation 
multiplication on each factorization equation. The number of iterations to convergence can depend on several 
factors, such as the longest path in the graph, NMF algorithm employed, etc. 

3 Factored sequential data models 

We now consider PFNs for modeling sequential data, which we will refer to as dynamic positive factor 
networks (DPFNs). A DPFN provides for the modeling of a data sequence as an additive combination of 
realizations of an underlying process model. Analogously to a dynamic Bayesian network (DBN) [10], a 
DPFN is created by horizontally replicating a PFN subgraph. Thus, a DPFN is simply a PFN that has 
a regular repeating graphical structure. The corresponding parameters are also typically replicated. We 
only require that the particular subgraph that is replicated correspond to a valid PFN. We will refer to the 
subgraph that is replicated as corresponding to a time slice. Although we use terminology that implies a 
time series, such an interpretation is not required in general. For example, the data could correspond to a 
biological sequence, or the characters or words in a text document, for example. 

Our models will make use of an non-negative linear state representation. Intuitively, one can think of the 
model as supporting the simultaneous representation of any additive combination of allowable realizations 
of an underlying state transition model, or more generally, an underlying dynamic process model. A state 
variable in the model is defined such that the dimensionality of the variable corresponds to the number of 
possible states. We allow these variables to be general non-negative vectors, so that the non-zero-valucd 
components of the variable represent a (positive) weight or strength of the corresponding state. Any given 
non-zero valued state variable then corresponds to some superposition of states. A given pair of state 
variables in adjacent time slices is then factored in terms of a set of basis vectors that specify the allowable 
transitions, and a corresponding encoding vector. The factored state representation seems to allow significant 
representational power, while making use of a compact parameter set. 

In this section we present a factored state transition model and present empirical results to illustrate the 
performance of the inference and learning algorithms. 

3.1 Model 

Consider a transition model with M states and R possible state transitions. Figure 3 shows an example state 
transition diagram for a 4-state automaton. This transition model contains M = 4 states, labeled S±, . . . , S± 
and R = 6 state transitions, labeled t\, . . . , t%. 

Figure 4 shows the DPFN that we will use to model a process that evolves according to the transition 
model given in Figure 3. This DPFN corresponds to a system of T — 1 factorization equations where for 
each t € 1, . . . , T — 1 we have: 



Xt 

Xt+l 



= Wht 



w 2 



h t (3.1) 



Since the same parameter matrix W appears in each equation, the above system of T — 1 vector factor- 
ization equations can be expressed as a single matrix factorization equation: 
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Figure 3: A state transition diagram for a 4-state automaton. 




Figure 4: The DPFN corresponding to the factorized state model in Equation (3.6). The first four time 
slices are shown. The state at time slice t is represented by Xt- We place constrains on states at adjacent 
time slices such that the states x t , Xt+i are represented as an additive combination of the allowable state 
transition vectors. h t represents the encoded transitions for x t ,x t+ i. 
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xi x 2 x 3 

x 2 x 3 X A 



XT -I 
X T 



= W[hi h 2 h 3 ... hr-i ] 

[hi h 2 h 3 ... h T -i ] (3.2) 

Let us define x Ct as the vertical concatenation of any two adjacent state variables Xt, Xt+\ so that: 



W 2 



x t 



(3.3) 



We define the 2M x T-l matrix X c as the following horizontal concatenation of the time-ordered x Ct 
vectors: 



_A C — y %c\ 



*L/ (-■!-, <xj p 



X\ x 2 x 3 
x 2 x 3 x 4 



Xct — 1 

XT-1 
X T 



(3.4) 



We define the R x T-l matrix H as the following horizontal concatenation of the time-ordered {h t } 
vectors: 



H 



h\ h 2 h 3 



h 



T-l 



(3.5) 



Using the above matrix definitions, we can then write the matrix factorization equation (3.2) more 
compactly as: 



X r = WH 



(3.6) 



We now illustrate how this DPFN can be used to represent the transition model in Figure 3. We let 
vector xt £ M M represent the model state at time t. The sequence, {x t : t — 1, . . . , T} is then modeled 
as a realization of the transition model specified in the state transition diagram. The parameter matrix W 
specifies the transition model, and the vector h t represents an encoding of (ir t ,x t+1 ) in terms of the basis 
columns of W. 

The parameter matrix W is constructed directly from the state transition diagram as follows. For each 
state Si, we define a corresponding state basis vector s, G K M such that the i'th component is 1 and all other 
components are zero. States Si, . . . , S4 then correspond to the following state basis vectors, respectivley: 



«i 



" 1 " 




' " 




" " 




" " 






,s 2 = 


1 




,s 3 = 




1 


,.s 4 = 





















1 



(3.7) 



For each transition tk in the state transition diagram that represents a transition from state Si to state 
Sj, we define the corresponding transition basis vector Wk as the vertical concatenation of the corresponding 
state basis vectors Sj on top of Sj. Each of the R transitions tk will then have a corresponding 2M x 1 
transition basis vector wj. given by: 



w k 
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For example, transition t 2 i n the diagram, which represents a transition from state S 2 to state S3 has a 
corresponding transition basis vector w 2 given by: 



w 2 







" " 




1 







S 2 
S 3 


= 









1 








(3.9) 



Let the 2M x R transition basis matrix W be defined as the horizontal concatenation of the transition 
basis vectors {wi}. Any ordering of the columns is possible. The columns of W then specify the allowable 
transitions in our model: 



W = [ 



tui w 2 w 3 



wr 



(3.10) 



We will find it useful to partition W into upper and lower sub-matrices W\ and W 2 such that the M x R 
upper sub-matrix W\ represents the time slice t — 1 state basis vectors and the M x R lower sub-matrix W 2 
represents the corresponding time slice t state basis vectors. One possible W corresponding to the transition 
diagram in Figure 3 is given by: 

W = [ W\ W 2 W3 W4 w 5 w@ ] 

W 1 

w 2 

si s 2 s 3 s 4 s 3 s 2 

S 2 S3 S 4 Si S3 s 4 



(3.11) 
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1 








1 



Consider the following state sequence of length T = 10, which is a valid sequence under the example 
transition diagram: 



se 9i — (Si, S 2 , S3, S4, Si, S 2 , S3, S4, Si, S 2 ) 



(3.12) 



Now suppose we construct a sequence of state variables {xt : t = 1, . . . , T} corresponding to this state 
sequence by setting x t equal to the corresponding state basis vector s^ for state Si at time t: 



xi = si 



, x 2 = s 2 



, x 3 = S3 



15 



, x w = s 2 



(3.13) 
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Let us define the M x T matrix X as the horizontal concatenation of the state variables Xt as: 



X = [ Xi x 2 x 3 ... xt ] 
We then have the following sequence X corresponding to the state sequence (3.12): 



(3.14) 



X = [ X\ X2 X3 X4 X5 Xq X-j X% Xo, X\Q ] 

10 10 10 
10 10 1 
0010001000 
0001000100 

Using the above X and (3.6), we then have the following model factorization: 



(3.15) 



X C =WH 



' 1 
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10 10 1 
10 10 
10 10 
10 10 
000000000 
000000000 



(3.16) 



Note that given Xq and W, only one solution for H is possible. We see that each pair of state vectors 
x Ct in X c corresponds to exactly one transition basis vector in the resulting factorization, since the value 
of each xt was chosen equal to one of the Si and X corresponds to a valid state sequence. The columns of 
H then given an encoding of X in terms of the basis vectors of W so that exactly one component of each 
column h t has value 1 , corresponding to the transition that explains x Ct . We say that a sequence X of state 
vector observations form an elementary state sequence if there exists a factorization Xq = WH such that 
each column of H contains at most one nonzero component. 

Now let us return to the case where the Xt is no longer constrained to be equal to one the state basis 
vectors. We only require Xt to be a non-negative vector in K M . Although W is specified in the example above 
as containing only and 1 valued components, in general we only place the constraint that the matrices of 
the factorization equation be non-negative. Note that since the model is a PFN, the linearity property from 
Section 2.1 holds, so that any additive combination of solutions is also a solution. That is, any additive 
mixture of state sequences that arc valid under the transition model of W arc rcprcsentablc. We can also 
see that this property follows directly from (3.6). For non-negative scalars a, (3, we have: 



X Ca =WH a 
X Cb =WH b 
<S> aX Ca + (3X bc =W(aH a + (3H b ) 



(3.17) 



For the case where multiple components of Xt are positive, the interpretation is that multiple states are 
simultaneously active at time t. For example, consider an example where Xt = (a, 0,0, f3) T . That is, at time 
t, the model is in state Si with magnitude a and is simultaneously in state S± with magnitude [3. Suppose 
that x t+ i is hidden. We can see from the factorization equation above that the solution is x t +i — ((3, a, 0, 0) T . 
That is, our factored transition model specifies that if the model is in state Si with magnitude a at time t 
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Figure 5: A deterministic state transition diagram for a 4-state automaton. 

then the model must be in state S2 with magnitude a at time t + 1. Likewise, if the model is in state S4 
with magnitude (3 at time t then the model must be in state S\ with magnitude j3 at time t + 1 . Due to the 
factored state representation, the model can represent both realizations simultaneously. A sequence of state 
vectors is modeled as an non-negative linear combination of the allowable transition basis vectors. 

Note also that since we do not impart a probabilistic interpretation, all allowable outgoing transitions from 
a given state can be considered equally likely For example, suppose x t +\ is hidden and that x t = (0, a, 0, 0), 
corresponding to the model being in state S2 with magnitude a at time t. Since there arc multiple outgoing 
transitions from S2 (£2 and ig), multiple solutions for Xt+\ are possible. We will explore the issue of performing 
inference when multiple solutions are possible in the following results sections. 

A learning and inference algorithm for this network can be obtained by applying Algorithm 2 to the 
system of factorizations (3.6). Example pseudocode is presented in Appendix D. 

3.2 Empirical results 

In this section, we perform inference and learning on the dynamic network in Figure 4, for two distinct 
transition models, using synthetic data sets. We present results for the cases of fully and partially observed 
input sequences. We also present results for sequences that have been corrupted by noise. We first consider 
the case where the underlying transition model is known. We then present results for the case where the 
transition model is learned from training data. 

3.2.1 Fully observed input sequences under a known transition model 

We start with a simple deterministic transition model, and will later revisit the earlier nondctcrministic 
model from Figure 3. Figure 5 shows the state transition diagram for a deterministic model that is obtained 
by removing transitions £5 and t$ from the earlier nondctcrministic model. Using the previously outlined 
procedure, we obtain a parameter matrix W corresponding to this transition diagram as follows. First we 
construct a transition basis vector Wk corresponding to each transition t^\ 

—(2) — (S)—=(S)- — (^)- (3 - 18 > 

The transition basis matrix W is then constructed as the horizontal concatenation of the transition basis 
vectors (again, the column ordering is arbitrary): 

W = [ W\ W2 U>3 Wi 1 (3.19) 

We start by considering the case where the state sequence X is fully observed, and the transition sequence 
H is hidden. We then wish to infer the values of H given the observed X. Suppose we wish to specify an 
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Figure 6: An image plot of the observed X from Equation 3.15, corresponding to the state sequence: 
(Si, S2, S3, S4, Si, S2, S3, S4, Si, S2). Row i of X corresponds to state S,;. Column i corresponds to the i'th 
time slice. Here and throughout this paper we use the "hot" color map shown on the right, so that components 
with a minimum component value are displayed in black, components with the maximum component value 
are displayed in white, and values in between are displayed in shades of red, orange, and yellow. 

input sequence X that corresponds to the following sequence of states: (Si, S2, S3, S4, Si, S2, S3, S4, Si, S2), 
which is a valid state sequence under the transition diagram in Figure 5. This sequence can be represented by 
setting X = a [ S\ s-i S3 S4 S\ S2 S3 S4 Si S2 ] , where a is any positive scalar. For example, 
the choice a — 1 results in: 



X 



10 10 10 
10 10 1 
0010001000 
0001000100 



(3.20) 



We will be dealing with matrices that are non-negative, typically sparse, and such that many or all of 
the non-zero components will take on the same value. For visualization purposes, the particular value will 
typically not be relevant. For these reasons we find that, rather than simply printing out the numerical 
values of a matrix as above, an image plot can be a more visually appealing way to take in the relevant 
information. In the remainder of this paper, we will therefore make extensive use of image plots of the 
matrices that appear in the various PFN factorization equations. In displaying the image plots, we make use 
of the "hot" colormap, so that zero-valued components appear black, small values appear red, larger values 
yellow, and the maximum valued componcnt(s) in a matrix appear white. Figure 6 shows an image plot of 
the observed X from above. 

Recall that the network in Figure 4 corresponds to the matrix factorization Equation (3.6), which we 
reproduce here: 



Xr 



WH 



(3.21) 



Recall also that Xc is defined as the vertical stacking of adjacent columns of X and so only H is unknown 
(hidden). We solve for the values of H by applying the inference algorithm from Appendix D, which is a 
special case of the general learning and inference algorithm specified in Algorithm 2. In solving for H, 
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Figure 7: A plot of the matrices of the factorization Xc — WH, after solving for H Note that the upper half 
submatrix of X c is equal to X, which corresponds to the state sequence: (Si, S2, S3, S4, Si, S2, S3, S4, Si, S2). 
Here, W corresponds to the deterministic state transition diagram in Figure 5 

this corresponds to first initializing H to random positive values and then iterating the inference algorithm 
(learning is disabled since W is known) until convergence. For all empirical results in this paper, hidden 
variables are initialized to random positive values uniformly distributed between and 1CP 6 . Each iteration 
of the inference algorithm in this case corresponds to performing an NMF right update step, e.g., using 
(A.6). 

Figure 7 shows a plot of the matrices in the factorization, where H is shown after convergence of the 
inference algorithm. For this input sequence, a few hundred iterations was sufficient for the RMSE (root 
mean squared error between X and the reconstruction WH) to drop below 10~4. The inference computation, 
implemented in Java and Python, took approximately 1 second to run on a desktop PC with a 3 GHz Core 
2 Duo processor, which is representative of the time required to run each of the examples in this section. 

We now add some noise consisting of uniformly distributed values in [0,0.1] to X. Figure 8 shows the 
resulting estimate for H . The noisy observations can no longer be represented exactly as a linear combination 
of the transition basis vectors, yielding an RMSE of 0.034. We observe empirically that the approximate 
factorization found by the NMF updates appears to be relatively insensitive to small amounts of additive 
noise, and produces an inference result that appears visually similar to applying additive noise to the previous 
noiseless solution for H . 

We now consider an example where the observation sequence corresponds to a mixture of realizations of 
the underlying transition model. Recall that the if observation sequence X a and the hidden sequence H a are 
one solution to the model, and observation sequence Xt, and the hidden sequence Hi, are another solution to 
the model, then the observation sequence X = X a + Xt, and the hidden sequence H = H a + Ht, are also a 
solution of the model. As an example, consider the observation sequence X = X a + Xj,, where X a and Xt, 
are given by: 
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Figure 8: A plot of the matrices of the factorization Xc = WH, where the observations X are corrupted 
by additive noise. The inferred hidden transition sequence H is shown after convergence of the inference 
algorithm. Here, W corresponds to the deterministic state transition diagram in Figure 5 



X a =0.5 [ 8\ s 2 s 3 s 4 Si s 2 S3 s 4 Si s 2 ] 
X b =1.0 [ s 3 s 4 Si s 2 s 3 s 4 s x s 2 s 3 s 4 ] 



(3.22) 



Figure 9 shows an image plot of the observed sequence X = X a + Xjj. Note that both sequences X a 
and Xb correspond to a valid realization of the underlying transition model of the network. From the 
superposition property, we know that the sum sequence X has a corresponding hidden sequence H which 
satisfies the factorization Equation 3.21 with equality. The expressiveness of this network is such that any 
observations and hidden variables corresponding to any non-negative superposition of valid realizations of 
the underlying transition model specified by W are representable. This does not necessarily mean that 
our particular choice of inference algorithm will always be able find the corresponding solution, however. 
We do observe, though, that for this particular network choice, repeated runs of our inference algorithm 
always converged to an exact solution (specifically, inference was stopped when the RMSE dropped below a 
certain threshold: 10~4). Figure 10 shows an image plot of the inferred H along with the other matrices in 
Equation 3.21. 



3.2.2 Partially observed input sequences under a known deterministic transition model 

We now consider the case where the sequence X is partially observed so that we are given the values of Xt 
for some subset of time slices, and the values of x t for the remaining time slices are hidden. We would then 
like to infer the values of all hidden variables in the model, which will consist of H along with the hidden 
subset of {x t }- 

Suppose that X represents a sequence of length 10 where the first time slice X\ is observed and corresponds 
to state Si. That is, x\ — as\. We arbitrarily choose a — I. We then wish to infer the values of the future 
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Figure 9: An image plot of an observation sequence X = X a + Xf, consisting of an additive combination of 
two sequences X a and Xb, which are each a realization of the underlying network transition model. Here, 
X\ begins in state S\ in the first time slice, and corresponds to the orange components since it has half the 
magnitude of sequence Xf,. Sequence X\, begins in state S3 and corresponds to the white components. 
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Figure 10: An image plot of the matrices of the factorization Xq — WH, where the observations are an 
additive combination of two sequences: X = X a + X&. H is shown after convergence of the inference 
algorithm. Here, W corresponds to the deterministic state transition diagram in Figure 5 
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Figure 11: A plot of the estimate for X after various iteration counts of the inference algorithm. The top 
image shows the initial X which only has the first time slice (time index 1 in the figure) set to state 5*1 (i.e., 
x\ = si). The future time slices (a; 2 . . . Xio) are initialized to small random values, which appear black in the 
figure since their maximum value of 10~6 is small compared to the largest value of 1 in the figure. The next 
lower figures show the X estimates after 10, 50, and 500 iterations, respectively. We see that 500 iterations 
is sufficient to effectively reach convergence. 

time slices of X as well as the values of H . That is, we wish to perform prediction. We again use the same 
inference algorithm and initialize all hidden variables to small positive random values. 

Figure 11 shows the initial X and the estimates after 10, 50, and 500 iterations of the inference algorithm. 
We see that convergence is effectively reached by 500 iterations. Figure 12 shows the corresponding plots 
of the matrices of the network factorization equation after 500 iterations. We observe that the inference 
estimates for X appear to converge outward in time from the observed time slice x\. The local variable 
averaging step of the inference algorithm causes the transition model to propagate a positive value one time 
slice forward and backward for each iteration. Note also that given the observed time slice, there is only one 
possible solution for the hidden variables (and that our inference algorithm successfully finds it) since the 
transition model is deterministic. 

Although the first time slice x\ was chosen as the only observed variable above, we could have just as 
easily chosen any subset of time slices (or even any subset of components) of X and H as the observed subset 
Xe- For example, Figure 13 shows the estimates of X for various iteration counts for the case where the 
time slice x^ is observed and all other time slices are hidden. As expected, we observe that the estimates 
propogate outward (both backward and forward) from x^ with increasing iteration count. 

3.2.3 Partially observed input sequences under a known nondeterministic transition model 

In this section we perform experiments to observe how the inference algorithm performs when multiple 
solutions for the hidden variables are possible. We will employ a nondeterministic state transition model 
and supply partial state observations so that multiple solutions will be possible for at least some of the 
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Figure 12: An image plot of the matrices of the factorization Xc = WH, after solving for H and the hidden 
time slices (xi . . . xio) of X. Here, W corresponds to the deterministic state transition diagram in Figure 5 

hidden states. We will now use the nondeterministic state transition diagram in Figure 3, which corresponds 
to the transition basis matrix W from Equation 3.11. 

We first perform an experiment to verify that the inferred values for the hidden transitions H are 
correct for a short fully observed sequence of observations X. Figure 14 shows an image plot for X = 
[ si S2 S3 S3 S4 si S2 S4 Si S2 1 , which represents a valid sequence of state transitions under 
the transition model in Figure 3. We then perform inference on H using the algorithm from Appendix D, 
which is a special case of the general learning and inference algorithm specified in Algorithm 2. The inference 
results are shown in Figure 15 and correspond to the correct factorization. 

We now consider a partially observed sequence of state observations for X . Let 



X 



«2 



? S 3 S 4 ? ? ? ? ? 



(3.23) 



where the "?" state vectors denotes a hidden state. We initialize H and all hidden variables in X to 
small positive random values and run the inference algorithm to convergence. Figure 16 shows X before and 
after performing inference. An exact factorization is found and is shown in Figure 17. We observed that 
repeated runs always converged to an exact solution, although different (but valid) result was obtained each 
time. Observe that some of the hidden variables correspond to exactly one state, while others correspond to 
a superposition of states, such that the column sums are equal. We do not explicitly normalize the columns 
to have equal column sums in the inference procedure, but rather, the equality follows from the choice of W 
and the observed variables. Looking at the transition diagram in Figure 3, we see that the inferred variables 
with a single state component (i.e., x t : t £ {1,3,6, 7}) occur at the time slices for which there is exactly one 
possible state that satisfies the transition model, given the observed variables in X. For example, time slice 
2 has an observed state x-i — S2, corresponding to 5*2. From the transition diagram we see that the only 
state that can transition to S^ is Si and therefore the hidden state at time slice 1 must correspond to state 
Si. The variables with multiple distinct state components (i.e., Xt,t £ {8,9, 10}) correspond to time slices 
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Figure 13: A plot of the estimate for X after various iteration counts of the inference algorithm. The top 
image shows the initial X in which x^ is observed and all other variables arc hidden. Here wc let X4 correspond 
to state S\ by setting x± = s\. All other variables (i.e, (xi, . . . , X3), (#5, . . . , Xio), (hi, . . . , ft-io)) are initialized 
to small positive random values. We see that convergence is effectively reached by 500 iterations. 
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Figure 14: An image plot of X corresponding to the state sequence: (Si, S2, S3, S3, S4, Si, £2, £4, Si, S2). 
This is a valid realization under the transition diagram in Figure 3. 
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Figure 15: An image plot of the matrices of the factorization Xc — WH, after solving for H. 
Note that X is equal to the upper half submatrix of Xc and corresponds to the state sequence 
(Si, S2, S3, S3, S4, Si, S2, S4, Si, S2). Here, W corresponds to the nondeterministic state transition diagram 
in Figure 3 
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Figure 16: The top image plot shows X after initialization, corresponding to the partially observed state 
sequence: (?,>S2,?, S3, S4, ?,?,?,?,?), where "?" denotes a hidden state. The bottom image plot shows 
the inference results for X in which the hidden variables have now been estimated. Observe that variables 
Xi,Xa,Xe, £7 correspond to exactly one state and variables xs,xg,x\o correspond to a superposition of states, 
since multiple solutions are possible for Xs,Xg, x\q. 



for which there are multiple possible transitions that satisfy the transition model. We can think of these 
time slices simultaneously being in multiple distinct states. For example, a valid solution for hidden state 
Xg, can consist of any additive combination of state vectors S3 , S4 such that the column sum of ccg is 1 . 

Suppose that we now make the final time slice observed by setting xio = S2, corresponding to the partially 
observed sequence (?, 1S2, ?, 5*3, S4, ?, ?, ?, ?, 82)- In this case, there is only one possible configuration of hidden 
states that is constant with the observed states and with the underlying transition model. Figures 18 and 19 
show the corresponding inference results. We observed that repeated runs of the algorithm always converged 
to the correct solution. 

In the case where multiple valid solutions are possible for the hidden variables, it is interesting to consider 
modifying the inference algorithm to attempt to find sparse solutions. One possibility that is simple to 
implement and seems to provide robust solutions consists of replacing the standard NMF update steps of 
the basic inference algorithm specified in Algorithm 2 with corresponding sparse NMF update steps. For 
these experiments, we chose to use the nonsmooth non-negative matrix factorization (nsNMF) algorithm [5] 
which is described in Appendix B for reference. Using the partially observed sequence from Equation 3.23 
and the modified inference algorithm with an nsNMF sparseness value of 0.1, we obtain the results shown 
in Figures 20 and 21. Repeated runs of the algorithm appear to produce distinct solutions such that each 
inferred hidden state Xt has only a single nonzero component, corresponding to a single active state in any 
given time slice. For sparseness values much lower than 0.1, we observe that a solution consisting of a 
superposition of states can still occur, but most of the weight tends to be concentrated in a single state for 
each time slice. 

We now perform an experiment to see what will happen if all model variables are made hidden (i.e., X 
and H are initialized to positive random values), and we perform inference using the sparse NMF updates. 
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Figure 17: An image plot of the matrices of the factorization Xc = WH, after solving for both H and the 
hidden variables of X. 

The top plot of Figure 22 shows X after being initialized to random positive values uniformly distributed 
between and 1. The bottom plot shows the result of one run of the inference algorithm using a nsNMF 
sparseness value of 0.1. Figure 23 shows the resulting factorization, which converged to an exact solution. We 
observe that repeated runs of the inference algorithm produce factorizations corresponding to a randomly 
sampling of the underlying transition model. This interesting result only appears to occur when we use 
the sparse inference algorithm. The non-sparse algorithm also converges to an exact factorization, but the 
inferred X and H then tend to consist of a superposition of many states in any given time slice. 

3.2.4 Learning the transition model from training data 

We now attempt to learn a transition model from training data. The training data will consist of an observed 
sequence X, and H will be hidden. The model parameters W are therefore unknown and will be learned from 
the observed sequence X. We first consider the case where the input sequence X is an elementary sequence, 
so that any two adjacent state vectors Xt,Xt+\ can be represented by a single transition basis vector in W. 
That is, any state vector Xt contains only one nonzero component and therefore corresponds to exactly one 
state in the transition model. We let X consist of a sequence of state basis vectors, such as: 



X=[s x 



S2 S 3 S 3 



(3.24) 



where the states are constrained to evolve according to the transition diagram in Figure 3. The training 
sequence is generated as follows: We choose the initial state x\ randomly with uniform probability. When 
multiple outgoing transitions from a state are possible, the next state vector is chosen randomly with equal 
probability from the set of allowable transitions. For all experiments in this section, we use a training 
sequence length of 1000. 
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Figure 18: The top image plot shows X after initialization, corresponding to the partially observed state 
sequence: (?, S2, ?, S3, S4, ?, ?, ?, ?, 52), where "?" denotes a hidden state. The bottom image plot shows X 
again after running the inference algorithm to infer the hidden states. In this case, there is only one possible 
configuration of hidden states that is constant with the observed states and with the underlying transition 
model. 
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Figure 19: An image plot of the matrices of the factorization Xc = WH, after solving for H. This is for 
the case where X corresponds to the paritally observed sequence: (?, 52,?, S3, S4, ?, ?, ?, ?, S2). In this case, 
there is only one possible configuration of hidden states that is constant with the observed states and with 
the underlying transition model. 
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Figure 20: The top image plot shows X after initialization, corresponding to the partially observed state 
sequence: (?, S^,?, S3, S4, ?,?,?,?, ?), where "?" denotes a hidden state. The bottom image plot shows X 
again after running the inference algorithm to infer the hidden states. Sparse NMF with sparseness = 0.1 was 
used. We observe that multiple solutions are possible (since X10 is now hidden), but the sparsity constraint 
leads to solutions in which only 1 state is active in any given time slice. 



M0 



©2008 by Brian K. Vogel 



3.2 Empirical results 



3 FACTORED SEQUENTIAL DATA MODELS 



Partially observed A" f 



Inferred // 




■] 6 

CcLmn 



Figure 21: An image plot of the matrices of the factorization Xq = WH, after solving for H . Note that X 
is equal to the upper half submatrix of Xq and corresponds to the state sequence (?, S2, ?, S3, S4, ?,?,?,?, ?). 
Sparse NMF with sparseness = 0.1 was used. We observe that multiple solutions are possible (since ccio is 
now hidden), but the sparsity constraint leads to solutions in which only 1 state is active in any given time 
slice. 
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Figure 22: The top image plot shows X after being initialized to random positive values. The bottom image 
plot shows X again after convergence of the inference algorithm, using sparse NMF with sparseness = 0.1 



Recall that inference and learning is performed jointly, so that in the process of learning W, we also end 
up with an estimate for H (the transition activations for the training sequence). The column count of W 
is a free parameter that must be specified before performing learning. Since we already know the transition 
model that was used to generate the training data, we know that we must specify at least 6 columns for W in 
order to have any hope of achieving an exact factorization. We will therefore specify that W have 6 columns. 
We then run the inference and learning algorithm from Appendix D to convergence. Figure 24 shows the 
inference and learning results on a length 10 sub-sequence of X. Observe that the learned W contains the 
6 correct transition basis vectors. However, we have observed that it can sometimes take a few runs of the 
algorithm in order to find an exact factorization. When the column count of W was then increased slightly to 
8, we found that the inference and learning algorithm always converged to an exact factorization. Figure 25 
shows the inference and learning results for a W with 8 columns. Note that although the transition basis 
vectors are correct, there are now some duplicate columns in W . This is due to the normalization step in 
which the columns of W are normalized to have unit sum as part of the NMF left (learning) update step. 

We have observed that the number of duplicate columns in W can be reduced if we use sparse NMF 
updates in the inference and learning algorithm and also remove the column sum normalization step from 
the NMF left update step. We instead normalize each column of W so the the upper and lower sub-columns 
(corresponding to W\ and W2) are constrained have equal sum, which may be 0. This allows unneeded 
columns of W to tend towards zero during the learning updates. Using an nsNMF sparseness value of 0.1, 
we obtain the results in Figure 26. We observe that even though it was specified that W have two more 
columns than required to represent the transition model, the learned W contains only the required 6 columns 
and two zero- valued columns. Note also that since we have omitted the step of normalizing all columns to 
have sum equal to one, the nonzero columns of W no longer sum to 1. If we wish for the nonzero columns 
to sum to the same value (e.g., 1), we could consider adding column normalization for the nonzero columns 
as a post-processing step. We have also observed that the use of the nsNMF algorithm leads to a small 
approximation error so that the resulting factorization is no longer exact. 
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Figure 23: An image plot of the matrices of the factorization Xq = WH, after solving for X and H, which 
were initialized to random positive values. Even though all model variables are hidden, the sparse NMF 
updates caused the inference algorithm to converge to a sparse solution corresponding to an valid state 
sequence under the state transition diagram in Figure 3 in which only a single state is active in any given 
time slice. 
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Figure 24: An image plot of the matrices of the factorization Xc = WH , after learning W and solving for 
H, which were initialized to random positive values. W was constrained to have 6 transition basis vectors. 
The training sequence X consisted of state basis vectors that evolved according to the transition diagram in 
Figure 3. 
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Figure 25: An image plot of the matrices of the factorization Xc — WH, after learning W and solving for 
H , which were initialized to random positive values. W was constrained to have 6 transition basis vectors. 
Since there are two more columns than necessary to represent the transition model, observe that there are 
two duplicate columns in the learned W. The training sequence X consisted of state basis vectors that 
evolved according to the transition diagram in Figure 3. 
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Figure 26: An image plot of the matrices of the factorization Xq = WH, after learning W and solving 
for H. W was constrained to have 6 transition basis vectors. There are two more columns than necessary 
to represent the transition model, but we remove the normalization step from the W update and set the 
nsNMF sparseness value to 0.1. The learning algorithm then learns only the 6 required columns. That is, 
we see that two columns of the learned W are zero- valued. 
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Figure 27: An image plot showing the first 10 time slices of the length 1000 training sequence X consisting 
of an additive mixture of 3 elementary state transition sequences. 

We now consider the more challenging learning problem in which the training sequence consists of an 
additive mixture of realizations of the transition model in Figure 3. We construct the training sequence X 
as the following additive mixture of three scaled elementary state transition realization sequences: 



X = 0.5 * Xelemi + 1.0 * Xelerri2 + 1.5 * Xelem^ 



(3.25) 



where the Xelenii are independent elementary state sequences, each consisting of a sequence of state 
basis vectors that conform to the transition model in Figure 3. Figure 27 shows the first 10 time slices of 
the training sequence. Note that each time slice of X now corresponds to a state vector that represents 
a superposition of three states. Given this training sequence, we perform inference and learning using 8 
columns for W. The algorithm converges to an exact factorization, which is shown in Figure 28 and we see 
that the underlying transition model was learned, even though the training data consisted of an additive 
mixture of elementary state sequences. Thus, even though a realization of the transition model was not 
presented individually as training data, the model still was able to learn an underlying transition model that 
explained the mixture training sequence. 

We now add a noise component to the training sequence so that: 



X = 0.5 * Xelemi + 1.0 * Xelem-i + 1.5 * Xelem^ + e 



(3.26) 



where e is a 4 x L matrix of uniformly distributed noise between and 0.1. Figure 29 shows a portion of 
the noisy training sequence. The inference and learning algorithm converges to the approximate factorization 
shown in Figure 30. We see that the underlying transition model was still recovered in W, although with 
a small error. These experiments indicate that it is possible to learn an underlying transition model, given 
only noisy training data corresponding to an additive mixture of realizations of the underlying model. 
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Figure 28: An image plot of the matrices of the factorization Xc — WH , after learning W and solving for 
H where the X training sequence consists of an additive mixture of three state transition sequences. W was 
specified to have 8 transition basis vectors. The underlying transition model used to generate the training 
data corresponds to the transition diagram in Figure 3. 
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Figure 29: An image plot showing the first 10 time slices of the length 1000 training sequence X consisting 
of an additive mixture of 3 elementary state transition sequences and a noise component. 

4 Hierarchical factored state model 

In this section we present DPFNs for modeling sequential data with hierarchical structure. The general 
DPFN framework allows for the representation of dynamical models with complex hierarchical structure. 
We present multilevel hierarchical networks in which only certain combinations of upper level and lower level 
variables and state transitions may be simultaneously active. We show how such networks provide for the 
modeling of factored hierarchical state transition models. 

The models that we present in this section can be thought of as two copies of the network from Figure 
Figure 4 to represent multiple levels in a hierarchy along with an additional coupling module to enforce 
desired coupling constraints between pairs of variables within a time slice. As a concrete example, we 
present a hierarchical state model for a regular expression and present empirical results. 

4.1 Model 

Consider a 2 level network in which each level represents a state transition model. One can think of each 
of the levels as having a corresponding state transition diagram. The active state transitions in the 2 
levels are then coupled so that only certain combinations of level 2 and level 1 state transitions may be 
simultaneously active. Figure 31 shows a 2-level network for modeling a factored hierarchical state model 
with two levels in the transition hierarchy. The model immediately extends to an arbitrary number of levels. 
The level 2 state variables are denoted by {x^ : t = 1, . . . ,T} and the level 1 state variables are denoted 
by {x\ : t = 1, . . . , T}. The level 2 transition variables {/if : t = 1, . . . , T — 1} are coupled to the level 1 
transition variables {h\ : t = 1, . . . , T — 1} via the coupling variables {v t : t = 1, . . . , T — 1}. 

Note the symmetry between levels 1 and 2. One can think of the level 1 and level 2 transition models as 
executing independently except for where the coupling constraints forbid it by constraining the combinations 
of level 1 and level 2 state transitions that can be simultaneously activated. Any two consecutive time slices 
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Figure 30: An image plot of the matrices of the matrices of the factorization Xc = WH, after learning W 
and solving for H where the training sequence X consists of an additive mixture of three state transition 
sequences plus a noise component. W was specified to have 8 transition basis vectors. The underlying 
transition model used to generate the training data corresponds to the transition diagram in Figure 3. 
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Figure 31: A DPFN with a coupled 2-level transition model. The level 2 state variables are denoted by x\ 
and the level 1 state variables are denoted by x\. The level 2 transition variables h\ are coupled to the level 
1 transition variables hi . The first four time slices of the T time slice network are shown. The dashed lines 
represent (optional) forced factored co-activations, which are enforced if all sub-columns of the parameter 
matrices are normalized to have equal column sums. 
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(t,t+ 1) of Figure 31 correspond to the following three factorizations: 
For level 1, we have: 



r rA ' 
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(4.1) 



For level 2, we have: 



H+i 



W z h 



2,2 



w 2 

W 2 



(4.2) 



For the level 1 to level 2 transition coupling, we have: 
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(4.3) 



For T time slices, we then have the following three matrix factorization cuqations: 
For level 1, we have: 
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(4.4) 



which we can write more concisely as: 



Xl =W X H X 
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H l 



(4.5) 



For level 2, we have: 
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(4.6) 



which we can write more concisely as: 



X 2 =W 2 H 2 



W 2 

w 2 



H 2 



(4.7) 
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For the level 1 to level 2 transition coupling, we have: 
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Vt-i \ , we can then write the above factorization more concisely as: 



H 2 

H l 



=UV 



u 2 



V 



(4.9) 



The columns of U define the coupling basis. The positive components of each column ui specify compo- 
nents of h 2 and h\ that can be co-activated. That is, the stacked vector of h\ and h\ is representable as a 
non-negative combination of the columns of U, with v t giving the encoding in terms of the basis columns. 
A learning and inference algorithm for this network can be obtained by applying Algorithm 2 to the system 
of factorizations (4. 5), (4. 7), and (4.9). Example pseudocode is presented in Appendix E. 

We now consider a regular expression example similar in spirit to the one used by Murphy in [9] . Murphy 
showed how a Hierarchical hidden Markov model (HHMM) [11] could be used to model the hierarchical 
transition model corresponding to a regular expression. In this section, we show how a regular expression 
can be modeled in a somewhat analogous manner by using DPFN with hierarchical structure. We stress that 
there are many significant differences, however. For example, in a HHMM, each possible state transition is 
assigned a probability, which is not the case here for the DPFN. Also, in a HHMM, the hidden states are 
discrete-valued, whereas they are non-negative continuous-valued here. Figure 32 shows a state transition 
diagram for the DPFN in Figure 31 that models the regular expression "a+b(de)*c(de)+". Under this 
transition model, an exact state factorization is only possible if the input sequence satisfies the given regular 
expression. This regular expression specifies that we have one or more repetitions of "a" followed by "b" 
followed by zero or more repetitions of "de" followed by "c" followed by one or more repetitions of "de." We 
allow the expression to repeat after reaching the end state. Observe that states S 2 and S 2 of the top level 
transition model (which we refer to as the level 2 transition model) refine to the same lower level transition 
model (the level 1 transition model). Some states (the ones with letters under them) are production states 
that correspond to the production of the corresponding letters. The initial state is S 2 , and states S% and S3 
denote explicit ends states. Although explicit end states are used here, they are not needed in general, as 
we will sec in the next section. The semantics is such that when a transition to a refining state, such as Sf 
or S5 occurs, the refinement executes while the parent remains in the same state. We will choose the basis 
vectors for the coupling activations U so that at any time step, the sum of active state values in level 2 is 
equal to the corresponding sum of active state values in level 1. For example, the transition t 2 correspond 
to a transition from S 2 to S 2 in level 2 and a simultaneous transition from the off/ end state to the starting 
state Si in level 1. The next transition must correspond to a transition from Si (production of "d" to 5^ 
(production of "e"), while level 2 remains in the parent state S3. The hierarchical transition diagram places 
implicit constraints on transitions that may simultaneously occur (co-occur) in levels 1 and 2. For example, 
the level 2 transition t 2 can co-occur with the level 1 transition t\. An implicit self- loop transition from S 2 to 
S 2 can co-occur with level 1 transitions t\ and t\. Figure 33 shows the same transition diagram, annotated 
to include the implicit transitions. Table 1 shows the transition coupling constraints that specify the pairs of 
transitions in levels 1 and 2 that can co-occur. For example, note that when the refinement is not producing 
a "d" or "e" the end state remains in a self-loop. 

The level 1 transition model, level 2 transition model, and coupling constraints from the state transition 
diagram directly map into the corresponding parameters W 1 , W 2 , and U as follows. The W 1 and W 2 matri- 
ces are constructed following the procedure outlined in Section 3.1. Each w\ is the vertical concatenation of 
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Figure 32: A state transition diagram for the regular expression "a+b(de)*c(de)+". 




Figure 33: The state transition diagram for the regular expression "a+b(de)*c(de)+". The implicit transi- 
tions are shown in blue. 
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Table 1 : Transition coupling constraints for the example hierarchical transition model. Each column specifies 
a level 2 transition and a level 1 transition that can co-occur. 
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the two state basis vectors associated transition t\. W 1 is then constructed as the horizontal concatenation 
of the transition basis vector columns {w}}. We then have: 
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Likewise, W 2 corresponds to the upper sub-matrix of W 2 and W$ corresponds to the lower sub-matrix. 

We construct the transition coupling matrix U as follows. We start with the Q pairs of level 2 - level 1 
couplings specified in Table 1 . We construct U from this table such that the z'th column of U corresponds to 
the transition pair in the i'th column of the table. For the level 2 transitions, we replace each transition t 2 
in the table with the column vector u upper . e R R2 that contains a 1 in the component corresponding to the 
row of H 2 that activates the corresponding transition in W 2 . Likewise, for the level 1 transitions, we replace 
each transition ti in the table with the column vector ui ower . G M. R1 that contains a 1 in the component 
corresponding to the row of H 1 that activates the corresponding transition in W 1 . We then form the j'th 
column Uj of U as the vertical concatenation of u upper . on top of ui ower . so that we have for the (R2 + Rl) 
x Q matrix U: 



U ■■ 
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,,2 



U 2 «3 



upper i 
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loweri 



"uppers 
,1 



U Q 



"uppers 
,,1 



"upperQ 
lower-Q 



(4.14) 



where the R2 x Q submatrix U\ is given by: 
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II u 



^upperi ^upper-2 ^upper^ 

and the Rl x Q submatrix Ui is given by: 



^upperQ 



(4.15) 
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From Table 1, we then have the following for U: 
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(4.17) 



Note that U 1 activates H 2 (activations for W 2 ) and U 2 activates H 1 (activations for W 1 ) 
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4.2 Empirical results 

In this section, wc perform inference on the network in Figure 31 using the parameter matrices for the regular 
expression hierarchical transition diagram in Figure 33. We use parameter matrices W , W 2 , and U from 
Equations (4.10), (4. 13), and (4.17) respectively. 

We choose to model a length 10 sequence for ease of diplay of the results. We let the observed set Xe 
consist of {x\ = s\,x 2 = s 2 }. We let the hidden set X^ consist of all other model variables. That is, we 
are only given that in time slice 2, the level 2 state is S 2 (production state "b") in Figure 33 and that in 
time slice 7, the level 2 state is 5*1 (production state "c"). We are given no information about the level 1 
states, transitions, or coupling variable states. However, even given this limited information, the constraints 
impose by the hierarchical transition model result in a single possible solution for the hidden variables in 
time slices t = 1, . . . , 9. Multiple solutions (due to multiple outgoing transitions) are possible for time slice 
10. The reader can verify this by plugging in the observed states and studying the transition diagram to 
trace the patch of allowable transitions. 

We now perform inference using an instance of the general inference and learning algorithm given in 
Appendix E. All hidden variables are initialized to random positive values, and the algorithm is run for 500 
iterations, which is sufficient for convergence to an exact factorization (RMSE less than 10~4) for all three 
factorization equations. We observed that repeated runs always converged, although a different solution for 
time slice 10 was reached in each case (since multiple solutions are possible for this time slice). 

Figure 34 shows both the initialized and inferred values for X 1 and X 2 . Model variables x\ — s\ and 
x 2 = s 2 were observed and all other model variables were hidden and estimated by the inference algorithm. 
Note that the hidden values of X 2 (all time slices except t = 2,7) appear black in the figure because the 
initialized random values are very small (order of 10^6) compared to the maximum value of the observed 
time slices in the figure. The inference results for X 1 and X 2 are shown in the rightmost image plots. We 
sec that the inference algorithm successfully found the single possible solution for time slices t = 1, ... ,9. 
In time slice 10 there are two possible solution states for both x\ and x 2 and we see that the inferred 
values represent a mixture of the two possible state configurations. That is, in time slice 9, the inferred state 
is Xg = s 2 and x\ — s\. According to the transition diagram in Figure 33, we see that in the next time 
slice (slice 10), we can either remain in level 2 state S 2 while level 1 transitions to S\, or both level 2 and 
level 1 can transition to their respective end states: S 2 and S\. We observe that the inferred state values 
represent a superposition of these two possibilities. We also note that when sparse NMF updates were used 
in inference, the solution tended to chose one alternative or the other, but not a superposition of the two. 

Figure 35 shows the inference results for Xq and H 2 in the factorization Xq = W 2 H 2 in the top three 
image plots. The bottom three image plots show the inference results for X^ and H 1 in the factorization 
Xq = W 2 H l . Figure 36 shows an image plot of the inference results for the matrices in the transition 
coupling factorization. 

5 Multiple target tracking model 

We now consider a multiple target tracking model that is motivated by the problem of tracking auditory 
objects in computational auditory scene analysis (CASA) [12]. This example serves to motivate the potential 
usefulness of our approach to problems such as CASA. Recall that a DPFN supports the representation of an 
additive combination of realizations of underlying dynamical process models. Such a representation seems 
to fit very naturally within CASA problems, since the sound pressure waves that arrive at the eardrum 
are well modeled as an additive mixture of scaled versions of the sound pressure waves produced by each 
auditory object alone. If it turns out to be possible to use a DPFN to model auditory objects such as musical 
instruments, environmental noises, speech, etc., then the linearity property would imply that any additive 
mixture of the various supported auditory objects at various sound levels would then be representable by 
the model as well. Of course, this does not imply that any particular inference and/or learning algorithm 
will successfully be able to deal with such a mixture, but the fact that an additive mixture is representable 
at all seems to be quite powerful. Our approach has the desirable property that nowhere in our model do 
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Initial partially observed. \' 



X" after inference 




Figure 34: The initialized X 2 is shown in the upper left. Model variables x\ = s\ and x 2 = s 2 were observed 
and all other model variables were hidden and estimated by the inference algorithm. The initialized X 1 is 
shown in the lower left. All hidden variables were initialized to random positive values. The inference results 
for X 1 and X 2 are shown in the rightmost image plots. We note that given the two observed variables, only 
one possible solution exists for time slices t = 1, . . . , 9. Multiple solutions exist for the final time slice £ = 10. 
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Figure 35: The top three matrices show the inference results for Xq and H 2 in the factorization Xq = W 2 H 2 . 
The bottom three matrices show the inference results for Xq and H 1 in the factorization X^ = W 2 !! 1 . Model 
variables x 2 and x 2 were observed and all other model variables were hidden and inferred by the inference 
algorithm. Parameter matrices W 1 and W 2 are given in Equations (4.10) and (4.13), respectively. 
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Figure 36: An image plot of the inference results for the matrices in the transition coupling factorization. 
All variables H 1 , H 2 , V were hidden and inferred. Parameter matrix U is given in Equation (4.17). 

we need to explicitly specify anything about the maximum number of simultaneous targets/objects or the 
allowable magnitude range (volume levels) for targets. 

We will specify a dynamical model for a single target, and then observe how a scaled mixture of multiple 
targets is then automatically supported. This illustrates how the factored representation of a dynamical sys- 
tem allows for a compact parameter set, while still being expressive enough to support multiple simultaneous 
scaled realizations of the underlying process model. All parameters in the target model will be manually 
specified here for ease of illustration, although in principle one could attempt to learn them from training 
data as well. We therefore perform inference only in the results that follow. We will also work with synthetic 
target observation data, rather than actual audio recordings, again for ease of explanation and illustration. 
It is hoped that after reading this paper, the interested reader will find it straightforward to come up with 
ideas for developing more complex extensions of these models that may then be applied to practical problems 
involving real-world data sets. 

The problem setup is as follows. We are given a sequence of non-negative observation column vectors, 
one for each time slice. When concatenated horizontally, they form an image such that the time axes runs 
from left to right. The vertical axis then corresponds to sensor position or frequency bin index, depending on 
the interpretation. We therefore use position and frequency bin interchangeably in the following discussion. 
In the context of CASA, each observation vector can be considered a time slice of a magnitude spectrogram 
such that the components of an observation vector represent the short-time spectra of an audio signal. In 
the CASA interpretation, we can then think of the observation image as a magnitude time frequency image, 
such as a spectrogram. 

The observations may also be potentially corrupted by additive non-negative noise. Our goal is to infer 
the types and positions of all targets given only the target observation sequence. We may also be interested 
in predicting future target trajectories and/or inferring missing observations data. 

Specifically, we wish to model a target that has the following properties: 
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1. A new target can come come into existence (become alive) at any time instant. 

2. Each target has a certain localized pattern signature. That is, its position/frequency evolves in time 
according to an underlying transition model. Target classes are distinguished by their position signa- 
tures. For example, one can think of two distinct musical instruments: one produces a certain type of 
vibrato regardless of the note pitch, while the other instrument produces a short chirp followed by a 
more steady pitch. 

3. A target's pattern signature is modeled as being independent of the target's overall position/location/frequency 
bin index. Thus, we model a target's location as a position shift or frequency shift of the pattern sig- 
nature. We only allow a target's overall position to change at certain constrained points in the position 
signature transition model. In the context of CASA, a the target's overall location might correspond to 

the pitch of a note played by an instrument, which is independent of the instrument's pattern signature 
(e.g., each note is performed with a vibrato). 

4. A target may cease to exist (die) after some number of time slices which may or may not be determin- 
istic. 

5. When a target comes into existence, it can take on any positive magnitude value, which remains 
constant during the time that the target is alive. This corresponds to an auditory object, such as an 
instrument note event, that can begin sounding at an arbitrary sound level or loudness and remains 
at the same sound level until the note ends. It is possible to relax this constraint, but we enforce it in 
this example. 

6. Any number of targets may be simultaneously present, each potentially having a distinct magnitude. 
This corresponds to multiple auditory objects simultaneously occurring, such as several instruments 
performing together, or several people talking at the same time, for example. 

5.1 Model 

Figure 37 shows the DPFM for the target tracker. x\,t — 1...T are observed and all other variables are 
hidden. We now describe the role of each of the 7 factorization equations that specify this model. We make 
use of the matrix notation developed in Section 3.1 so that the factorizations may be expressed concisely. 

5.1.1 Target state transition model 

The target state transition model corresponds to the factorization: 



Xt =W 6 H 3 






H 3 (5.1) 



Figure 38 shows the target state transition diagram. Observe that states Sf through S§ correspond to 
a "A-type" target and states S% through Sg correspond to a "B-type" target. The initial state when a 
"A-type" target comes into existence is Sf, and the minimum duration (i.e., target lifetime) is 5 time slices, 
since the target must pass consecutively through states Sf through S$ before it is possible for the target to 
die (i.e., transition to the 0/off state). We see that when the current state is Sf the following state can be 
5| or the state. Recall that this is not a probabilistic model. Thus, multiple outgoing transitions from a 
given state can be thought of as transition possibilities that are equally likely. The particular transition (or 
superposition of transitions) that is chosen during inference depends on the values of the observed variables 
as well as any other constraints imposed by other parts of the model that are coupled to the transition model. 
The state transition model for a "B-type" target is similar but only consists of 4 states. We construct W 6 
directly by inspection of the state transition diagram using the procedure described in Section 3.1. Note 
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Figure 37: The DPFN for the the multiple target tracker. The target observations variables x\,t = 1...T are 
observed and all other variables arc hidden. The first two time slices arc shown. 
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Figure 38: The state transition diagram for the target state variables x\. The corresponding state emission 
vectors x\ are also shown. 

that there are 9 states and 13 transitions. The "off" state is not an explicit state, and is represented by the 
9-dimcnsional vector in the transition basis matrix. Thus W 6 has size 18 x 13. The state basis vectors 
{sf} are therefore 9-dimensional column vectors, as are the target state variables {xf}. Each wf is the 
vertical concatenation of the two basis state vectors associated transition tf . W e is then constructed as the 
horizontal concatenation of the transition basis vector columns {wj}. We then have for W e : 



W 6 = [ 



w% 



'10 



11 






4 



4 



4 



5.1.2 Target label to target state coupling 

The target label to target state coupling corresponds to the factorization: 



12 



'13 



(5.2) 



A 5 
A 4 



--W 7 U 3 






U 6 



(5.3) 



This module performs the task of assigning a target type label x^ to a target state x\. From Figure 38 
we see that there are two distinct target types. We let x\ denote the 2-dimcnsional column vector such that 
the first (top) component denotes the strength of an "A-type" target and the second (bottom) component 
denotes the strength of a "B-type" target. We will choose the basis columns U such that 



We wish to add coupling constraints so that the "A-type" basis vector 
any of the basis state vectors {s\,S2, s§, s\, s§} and the "B-type" basis vector s\ 
any of the basis state vectors {s§, sf, s|, 



'3i 

2 ,21 
9J- 



(1,0) T can co-occur with 
= (0, 1) T can co-occur with 



We construct each column of W 7 as the vertical concatenation 
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of an sf on top of and Sj that can co-occur. Noting that the column ordering is arbitrary, we have for the 
7x9 matrix W 7 : 



W< 






,,3 



(5.4) 



5.1.3 Target state to target emission coupling 

The target state to target emission coupling corresponds to the factorization: 



X 4 

x ? < 



--W 6 U 2 



wl 
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u z 



(5.5) 



This module performs the task of coupling the target state variable x\ to the corresponding production 



(emission) variable xf € 
basis vector sf € 



t 
er ™ , where patternSize = 5 in this example. Figure 38 shows the emission 
' "*' - e below the corresponding state Sf. If a given state Sf is present in x\ with 
magnitude a, then we wish for the corresponding emission variable sf to also be present in x 3 with magnitude 
a and vice versa. We construct each column of W 3 as the vertical concatenation of a basis state vector sf 
on top of the corresponding emission basis vector sf. The 10x9 matrix W 3 is then given as: 



W 3 



Wf 

W$ 



(5.6) 



5.1.4 Pattern translation transition model 

The pattern translation transition model corresponds to the factorization: 



Xl =W 2 H 1 



wf 
wl 



H 1 



(5.7) 



This module represents the transition model for the amount by the target emission pattern is translated 
(i.e., target position) between time steps t and t + 1. Figure 39 shows the state transition diagram for 
the position shift variables x\ with P possible distinct position shift amounts. Note that x\ is therefore 
a P dimensional column vector, with the i'th component corresponding to the magnitude of S\ at time t. 
Each state S} corresponds to a distinct target position. The index i representing the downward translation 
amount that the target pattern will be translated when it appears in the observation vector x\. The state 
S\ corresponds to the minimum downward translation amount, so that the target emission pattern appears 
as the top patternSize components of x\ . The state Sp corresponds to the maximum downward translation 
amount, so that the target emission pattern appears as the bottom patternSize components of x\. 

We specify in the state transition diagram that a transition t oni from the off state at time t to to any of 
the P possible position shift amounts Sj at time t+ 1 is allowed. Let t on denote the set {t orH , i = 1, . . . , P} of 
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Figure 39: The state transition diagram for the position shift variables x\ with P possible distinct position 
shift amounts. 

all "on" transitions. We specify that a transition toffo from any of the P possible positions S} to the off state 
is allowed. Let t ff denote the set {t ff t ,i = 1, . . . ,P} of all "off" transitions. A self-loop transition t se if. 
from any position state Sf to itself is allowed. Let t se if denote the set {tgeif^i = 1, . . . , P} of all "self- loop" 
transitions. Finally, we also allow a transition ti nCi from state Sj to <S'L_ 1 , as well as a transition tdeci from 
state Sj +1 to Sj. Let ti nc denote the set {t inCi ,i = 1, . . . , P — 1} of all "position increment" transitions. Let 
tdec denote the set {tdecni — 1, • • • , P — 1} of all "position decrement" transitions. Let t c h ange denote the 
union of t inc and i<j ec , which is the set of all "position change" transitions. The complete set of all position 
transitions is then Tpos = {t se if,t changei t on ,t ff}. 

We construct W 2 directly by inspection of the state transition diagram using the procedure described in 
Section 3.1. Using our convention for basis state vectors, state Sj in the transition diagram corresponds to a 
9-dimensional column vector sj such that the i'th component is 1 and all other components are 0. The "off" 
state is not an explicit state, and is represented by the 9-dimensional vector in the transition basis matrix. 
Since there are P states, the state basis vectors {sj} are 9-dimensional column vectors. Each column of 
the IP x | Tpos | matrix W 2 then corresponds to the vertical concatenation of the two basis state vectors 
associated with each allowable transition in Tpos. Again, note that the column ordering of the transition 
basis vectors is arbitrary. For example, the column of W 2 corresponding to transition ti nC2 would consist of 
the vertical concatenation of sJ; on top of S3. 



5.1.5 Target state transition to position shift transition coupling 

We now place coupling constraints on the target state transition and pattern translation transition models. 
Without these constraints, it would be possible for a target to have a positive state magnitude but a zero- 
valued corresponding position magnitude and vice versa. However, we wish for these magnitudes to be 
identical for a given target. Specifically, we add coupling constraints so that a target state "on" transition 
co-occurs with a target position "on" transition, and likewise for the "off" transitions. We also constrain the 
target position value so that a pattern position change (pattern translation) can co-occur with the repetition 
transitions (i§, tf 3 ) for the target pattern, but is otherwise disallowed. 

Although it would be possible to couple the target state transitions hf directly with the target position 
transitions h\, each of the transitions if would then have couplings for each of the P positions, resulting in a 
potentially large number of coupling basis vectors. We can reduce the number of required coupled transitions 
by introducing an intermediate high-level position transition variable h 2 . The target state transition variable 

55 ©2008 by Brian K. Vogel 



5.1 Model 



5 MULTIPLE TARGET TRACKING MODEL 



Target state transition 


t 2 


t 2 


t 2 


t 2 


t 2 

t 7 


t 2 


i 2 


t 2 


High-level position transition 


^change 


tself 


^change 


tself 


ton 


ton 


f off 


toff 




Target state transition 


t' 2 


*2 


h 


t 4 


h 


h 


tio 






High-level position transition 


tself 


tself 


tself 


tself 


tself 


tself 


tself 





Table 2: Transition coupling constraints between the target state transitions h^ and the high-level position 
transitions h 2 . Each column specifies a target state transition and high-level position transition that can 



co-occur. 



hf is coupled to h 2 via the factorization: 



H ? < 
H 2 



=W 5 V 2 



W? 
WS 



V* 



(5.8) 



and h 2 is coupled to the target position transition variable h\ via the factorization: 



H 2 



--W^V 1 



Wf 

w 4 



V 1 



(5.9) 



Let h 2 correspond to a 4-dimensional column vector such that each component corresponds to a distinct 



type of position transition, 
position." We let h 2 = a 2 = 



"position turn on." We let h 2 



Specifically, we let hf — a\ — (1,0,0,0) represent "remaining at the same 
(0, 1,0, 0) T represent "position change." We let h 2 = a 3 = (0, 0, 1, 0) T represent 



04 



(0,0,0,1) represent "position turn off." 



Table 2 shows the transition coupling constraints between hf and hf . We construct the parameter matrix 
W 5 from this table as follows. We see from 5.2 that setting /if equal to the standard unit basis vector e, £ M 9 
will activate transition t 2 . We then construct the j'th column of W 5 as the vertical concatenation of eJ on 
top of cik such that e 3 and at correspond to the j'th target state transition and high-level position transition 
in the table, respectively. We then arrive at the following 9 x 15 matrix for W 5 : 



W b = 



ee e e ei 2 ei2 e 7 ei 3 e 5 
a,2 a\ a-i <X\ a 3 a 3 a 4 



en ei e 2 e 3 e 4 e 8 e 9 e w 
0,4 ai a,\ a\ a\ a\ a\ a\ 



(5.10) 



We construct W A as follows. Note that the top 4 rows of W 4 correspond to the submatrix W 4 , which 
contain the basis activations for h 2 . The bottom submatrix W^ of W 4 contains the activations for h\. 
Note that setting h\ equal to the standard unit basis vector e, G IRl T P os l activates the i'th column of W 2 , 
corresponding to a position transition in Tpos = {t 8e if, t c hangei t nit ff}- For each i € 1, . . . , \Tpos\, we add 
a corresponding column wf to W 4 as follows. If the hi = e^ activates a transition in W 2 in the set t se if, let 
wf equal the vertical concatenation of a\ on top of e^. Otherwise, if hi = e^ activates a transition in W 2 
in the set t c h a nge, let wf equal the vertical concatenation of a 2 on top of ej. Otherwise, if hi = ei activates 
a transition in W 2 in the set t on , let wf equal the vertical concatenation of a 3 on top of e». Otherwise, if 
hi = ei activates a transition in W 2 in the set t ff, let wf equal the vertical concatenation of 04 on top of 
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5.1.6 Target pattern translation coupling 

The target pattern translation coupling corresponds to the factorization: 



r x 3 ' 






X 2 


=w 1 u 1 


X 1 






' w\ ' 
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wl 






[ W 3 J 



U 1 (5.11) 

where W\ has patternSize rows, W\ has P rows, and W$ has P — 1 + patternSize rows. 

We wish for the target emission vectors to appear translated (vertically shifted) in the observation vectors. 
An observation vector x\ has dimension greater than the target emission vector x 3 so that x 3 may appear 
as a subvector in x\ . The number of components by which x 3 is translated is specified by the pattern shift 
amount x\. If component i of x 3 has value a and component j of x 2 also has value a then we wish for 
component k = % + j of x\ to also have value a. We then construct W 1 as follows. For each combination 
of i e 1, . . . , dimension(x 3 ) and j <E 1 , . . . , dimension(x 2 ) add a column q^. to W . The subcolumn of 
<7i. corresponding to W\ has only the i'th component equal to 1 and all other components 0- valued. The 
subcolumn of qi. corresponding to W\ has only the j'th component equal to 1 and all other components 
0- valued. The subcolumn of qi . corresponding to W$ has only the /c'th component equal to 1 and all other 
components 0-valued. 

Figure 40 shows an image plot of the submatrices of W 1 for the case where patternSize = 5, dimension{x 2 ) 
15, and dimension(x]) = P — 1 + patternSize — 19. Note that W 1 has patternSizeP = 75 columns. Note 
that each column in W/, W\, W\ has column sum equal to 1 so that x 3 , x 2 , and x\ are modeled as having 
equal column sums. In our model, x\ is observed and all other variables are hidden. Note that if a given 
component of x\ has some positive value, then there are typically many basis columns in W 1 that can explain 
it, meaning that many solutions for x 2 ,x 3 are generally possible. However, the other factorizations in the 
model will add enough additional constraints that a unique solution for the hidden variables can still be 
possible. 

A learning and inference algorithm for this network can be obtained by applying Algorithm 2 to the 
system of factorizations Equations (5.1) - (5.11). Appendix F shows the psuedocode for the algorithm. Since 
all parameters matrices are known, we run the algorithm in inference-only mode with learning disabled in 
the following section. 

5.2 Empirical Results 

We now present empirical results for the target tracking model. We make use of the parameter matrices 
{14 72 } presented in the previous section so that the model parameters can be considered known. We present 
inference-only results for the case where the observation sequence X 1 is cither fully or partially observed. 

5.2.1 Clean target observations 

We first consider the case where X 1 is fully observed and consists of noiseless observations. All other model 
variables are hidden and will be inferred. The bottom image plot in Figure 41 shows the target observations. 
There are three target objects represented: two A-type targets and one B-type target. Observe that there 
are two targets present during time slices t = 4,5,6,7. The targets have magnitudes of 1.0, 0.8, and 0.6. 
These target objects were chosen such that they conform to the target model and therefore are exactly 
representable under the model. We used P — 15 position values for these experiments, yielding a pattern 
shift amount from -7 to 7. That is, P — 8 corresponds to the shift amount where a pattern is in the center 
of the observation vector. 
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Figure 40: Submatrices W^, W 2 \ and W£ of W 1 
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Figure 41: Inference results for X 2 , A 3 , A 4 , X 5 . Note that only the bottom X 1 is observed and consist of 
clean target observations. All other variables are hidden and were correctly inferred. 

We used the inference algorithm described in Appendix F, which is a special case of the general inference 
and learning algorithm presented in Algorithm 2. The top four image plots in Figure 41 show the inference 
results for X 2 , X 3 , A 4 , A 5 . We note that only A 1 was observed, so that higher-level variables U 1 , U 2 , U 3 , 
H 1 , H 2 , H 3 , V 1 , and V 2 were also hidden, but are not shown here since they are less relevant. 

We found that 10000 iterations were sufficient for all factorization equations to converge completely 
(RMSE less than 10e-4), which took approximately 1 minute to run on an a PC with a 3.0 GHz Intel 
Core2 Duo CPU. Given the observed sequence, exactly 1 solution is possible. All results for this model were 
obtained by running the inference algorithm for 10000 iterations. We observed that on repeated runs, the 
inference algorithm always converged to the correct exact solution (within our RMSE tolerance). We see 
that the observed sequence A 1 was successfully decomposed into the target label (A 5 ), target state (A 4 ), 
target emission pattern (A 3 ), pattern shift amount (A 2 ), and the other higher-level hidden variables which 
are not shown for space reasons. 



5.2.2 Noisy target observations 

We now add some noise to the previous clean target observations to see how inference will perform. The 
bottom image plot in Figure 42 shows the observed noisy target observations. Noise consisting of uniformly 
distributed values in the range [0, 0.06] was added to the clean target observations from the previous section 
to form the noisy target observations X . The maximum noise amount is thus one tenth of the smallest 
target magnitude. 
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Figure 42: Inference results for X 2 , X 3 , X 4 , X 5 . 
target observations corrupted by additive noise. 



Note that only the bottom X 1 is observed and consist of 



Figure 42 shows the inference results for X 2 , X 3 , X 4 , X 5 . As before, the results for the other variables 
in the model are not shown for space reasons. We observe that the inferred values also appear somewhat 
noisy, since the model is also trying to explain the noise in the target observations. We observed that the 
visual cleanness of the results appears to degrade gradually as more noise is added. Small amounts of noise 
are tolerated quite well. A noise amount less than 0.02 or so produces visually clean results. 



5.2.3 Noisy target observations using sparse inference 

Given the previous results for the case of noisy observations, we might like for the inference algorithm to 
attempt to ignore as much of the noise as possible while still accounting for the non-noise components. We 
modify the inference algorithm so that the NMF update steps are replaced by corresponding sparse NMF 
update steps using the nsNMF algorithm [5], which is describe in Appendix B. We have observed empirically 
that the quality of the inference results seems to be improved if the sparseness parameter value is gradually 
increased during the inference procedure. For these results, the sparseness value was increased linearly from 
at iteration 5000 to 0.05 at iteration 10000. 

The input observations are generated with the same targets plus additive noise as in the previous section. 
Figure 43 shows the inference results, where the bottom subplot shows the noisy target observations for 
reference. We observe that the inference results are visually less cluttered with noise compared to the 
previous non-sparse inference case. However, we observe that the magnitudes of the inference results are off 
by approximately a factor of 2, so that some renormalization would be needed as a post-processing step. 
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Figure 43: Inference results for X 2 , X 3 , X 4 , X 5 . Note that only the bottom X 1 is observed and consist of 
target observations corrupted by additive noise. Gradually increasing sparseness was used in the inference 
algorithm. 
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5.2.4 Noisy target observations using modified sparse inference 

As an experiment, we now modify the previous sparse inference procedure so that the observation sequence 
X 1 is made to be hidden (i.e., all model variables are now hidden) during the final several inference iterations. 
The motivation for this is that after basically reaching convergence on the noisy observations data, the model 
has settled on a somewhat sparse solution. However, the noise in X 1 cannot be ignored completely, and so 
the model tries to find a configuration of the hidden variables that best explain the noise to some extent. 
However, if we stop updating X 1 on each iteration with the actual observations, the model can then be free 
to ignore the noise and converge to a sparser solution. We need to be careful, though. If the model is allowed 
to run a large number of iteration with X 1 now hidden, it may gradually "forget" X 1 and settle on some 
other configuration of variables. 

We again supply the same target observations corrupted by additive noise as the previous examples. 
Sparseness was for iterations through 5000, and then gradually increased linearly reaching a maximum 
sparseness of 0.05 at iteration 9980. For the final 20 iterations, X 1 was made hidden. This was accomplished 
by simply disabling the re-copying of the observations data into the corresponding X 1 local variable during 
inference. Thus, on each iteration, the down propagated values for X 1 would be reused on the successive up 
propagation. 

Figure 44 shows the inference results for X 1 , X 2 , X 3 , X 4 , X 5 . Note that the inference results appear 
quite clean. However, the relative magnitudes of two of the targets are not correct. Note that the lowest- 
magnitude target in the figure corresponds to an to an inferred target with with magnitude greater than one 
of the other inferred targets. 

5.2.5 Prediction 

We now consider the case where some of the target observations are hidden. We will then infer their values 
are part of the inference procedure, along with the other hidden variables. We first consider the case where 
time slice 25 of the clean X 1 from Section 5.2.1 is hidden. The regular inference algorithm was used (i.e., the 
usual non-sparse NMF updates were used). 

Figure 45 shows the inference results for X 1 , X 2 , X 3 , X 4 , X 5 . We observe that the inferred values 
for x\ 5 are the same for the case of fully observed X 1 . This makes sense since the the transition model in 
Figure 37 specifies that target state 9 (x 4 5 ) must follow target state 8 (x 4 4 ) and a pattern shift is not allowed 
during this transition (i.e., X 2 must remain in the same state). Thus, given the observed time slices of X 1 
before and after 25, there is only one possible configuration of variable values for time slice 25 that satisfies 
the model, and our inference algorithm successfully solved for it. 

We now set the last 5 time slices of X 1 (£22> a; 23> a; 24) a '25> a '26) to be hidden and run the inference 
algorithm. Figure 46 shows the corresponding inference results. We observe that the infered values for the 
final 5 time slices now correspond to a superposition of states since multiple solutions are possible. 

In performing these experiments, we observed that the inference procedure appears to be quite robust, 
since for the noiseless case, complete convergence (to a very small RMSE) was achieved on every run that 
we performed. Even in the noisy case, the noise in the inferred results appeared to increase gradually in 
proportion to the noise in the observation sequence. A possible drawback is that a large number of iterations 
were needed in order for the algorithm to converge to a solution. We have not yet explored how the number 
of iterations to convergence varies with network size or graphical structure. 

6 Sparse hierarchical sequential data model 

In this section we consider a multilevel hierarchical DPFN for sequential data in which the activations required 
in order to explain a given sequence (or non-negative superposition of sequences) become more sparse we we 
ascend the hierarchy. We then present learning and inference results on magnitude spectrogram data. 
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Figure 44: Inference results for X 1 , X 2 , X 3 , X 4 , X 5 . Gradually increasing sparseness was used. Here X 1 
was observed for the first 9980 iterations and then made to be hidden for the last 20 iterations. Note that 
since X 1 was hidden for several iteration, the inferred values for X 1 here differ from those of the actual 
observed X 1 in the bottom image plot. 
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Figure 45: Inference results for X 1 , X 2 , X 3 , X 4 , X 5 . x^ 5 was hidden and all other time slices of X 1 are 
observed. We observe that the inference results are the same as for the case of a fully observed X 1 since 
only a single solution is possible. 
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Figure 46: Inference results for X 1 , X 2 , X 3 , X A , X 5 . Time slices 22 through 26 of X 1 are hidden. We 
observe that the infered values for the final 5 time slices now correspond to a superposition of states since 
multiple solutions are possible. 
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Figure 47: A 2-level DPFN for a sparse hierarchical sequence model. 



6.1 Model 



In the sequential data models presented in the previous sections, we modeled x t as being simultaneously 
representable as a both a linear function of parent h t ~\ and another linear function of parent h t . Thus, a 
nonzero value of the child node x t implied nonzero values for all of its parents and vice versa. In this section, 
we consider an alternative network structure for sequential data in which the value of a child node x t is 
modeled as sum of linear functions of multiple parents. Under this representation, a nonzero value of a child 
node implies that at least one parent node must also have a nonzero value and vice versa. Thus, a nonzero 
value of a single parent node can explain a sequence of possibly several nonzero child node values. 

Figure 47 shows a DPFN with two layers. We note that Non-Negative Matrix Factor Deconvolution [7] 
appears to be a special case of this network. The variables are non-negative vectors such that x\ G M Mi . 
Typically, {x\ } would correspond to the observed data sequence and {x\ } would be hidden. A node x\ with 
parents x\_ x and x\ corresponds to the factorization: 



--Wixi + W 2 x 



2*t-l 



[ Wi w 2 ] 



(6.1) 



For time slices 



. x\ we then have: 



c^] = [W 1 W 2 ] 



r 2 
•'-1 



■ r 2 



T-l 



(6.2) 



where we note that Xq = 0. Denoting the right-most matrix above as X|, we can then express the 
factorization more concisely as: 



X 1 = [W 1 W 2 ] X 2 S 



■-WXi 



(6.3) 
(6.4) 



Note the duplicated variables in X 

„2 



Since x: 



appears in two consecutive columns of X|, we see that 



activating x z , will correspond to the j'th column of W\ being activated in time slice j and the j'th column 
of W 2 being activated in time slice j + 1. Thus, one can think of corresponding columns of W\ and W 2 as 
comprising a transition basis pair. 

We might consider normalizing W so that all columns sum to 1. Alternatively, we could consider nor- 
malizing W so that the sum of the corresponding columns of all the Wj sum to 1. That is, we can view the 
matrix consisting of the j'th column of each consecutive Wi as a basis sequence. 
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Figure 48: A 3-lcvcl DPFN for a sparse hierarchical sequence model. The first five time slices are shown. 

Figure 48 shows a 3-level network. A sequence of nonzero values x\ , x\ +1 , x\ +2l x\ +s in level 1 can be 
represented by two nonzero values x 2 ,x 2 +2 l n level 2, and by a single nonzero value of x\ in level 3. That 
is, a single nonzero value x\ in level i + 1 represents a pair of consecutive values in level i. If viewed in a 
generative setting, a single nonzero level 3 variable x\ can cause the level 2 variables x 2 ,x 2 to be nonzero, 
which in turn can cause the level 1 variables x\,x 2 ,x\,x\ to be nonzero. 

Suppose that the level 1 variables x\,...,x\ are observed. Provided that the parameter matrices W 1 — 
[ W\ W 2 ] and W 2 — [ W 2 W 2 ] contain sufficient basis columns to represent the possible observation 
pairs {x\,x\ +1 ), a given sequence can be represented such that only every other level 2 variable is nonzero, 
and only every 4th level 3 variable is nonzero. Thus, as we ascend the hierarchy, the activations become 
increasingly sparse. 

We represent each level 1 variable in Figure 48 as: 



x\ 



m\x\ 



W}x 2 _ 



= [ wl w\ ] 



(6.5) 



Likewise, we represent each level 2 variable as: 



r2 < 



Wix 3 t _ 2 



-W(x 

- [ wl w 2 ] 



v t-2 



(6.6) 



For a network with T time slices, we then have the following factorizations: 



wl W\ 
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(6.7) 



Note that x\ = 0. 
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The network in Figure 48 extends immediately to networks with an arbitrary number of levels, arbitrary 
numbers of parent variables, and arbitrary separation (in time slices) between consecutive child nodes in any 
given level. Consider a network with L levels and such that the variables x\ in level i + l each have p child 
nodes with a separation of q time slices between nodes. Let x\ at each time slice t be an Mj+i dimensional 
column vector and let x\ at each time slice t be M, dimensional column vector. For each time slice t and 
i £ {1, . . . , L — 1}, x\ is then expressed by the factorization: 
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(6.9) 



For a network with T time slices, we then have the following factorization equation for each ie {1, 
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x j = [ 4 



(6.11) 
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we can express the factorization more concisely as: 



^+1 ~i+l 

X 2-(p-l)q X 3-(p-l)q 



X 
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(6.13) 



X' 



W'XJ, +1 



(6.14) 



Note that X* has size M, by T. Each Wj has M i+1 columns. The matrix W l has size M t by Mj +i . The 
matrix X % g~ has size pM i+ i by T. For example, a network with L = 2 levels, j? = 3 children and q = 2 
seperation will correspond to the factorization: 
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(6.15) 



A learning and inference algorithm for a network with L levels can be obtained by applying Algorithm 2 
to the system of factorizations equations (6.14). Appendix G shows the pseudocode for the algorithm. 

6.2 Empirical results 

We now present learning and inference results on magnitude spectrogram data. We let the lowest level 
variables X 1 correspond to a magnitude spectrogram such that x\ represents time slice t of a spectrogram. 
For this experiment, we use a 15 second audio clip from Beethoven's Piano Sonata No. 8 in C minor, op. 13, 
Adagio cantabile, performed by the author. The corresponding spectrogram, with 512 frequency bins and 
622 time slices (41 time slices per second) is shown in the bottom plot of Figure 50. 

We make use of a DPFN with the following network parameters: We use a 4-level network with X 1 
observed and X 2 , X 3 , and X 4 hidden. Let x 2 £ K 50 have p 2 = 4 child nodes and a separation of q 2 = 1 
time slice. Let x 3 £ ]R 40 have p 3 = 4 child nodes and a separation of <7 3 = 4 time slices. Let x\ £ M 40 have 
p 4 = 4 child nodes and a separation of q 4 = 16 time slices. 

We use the learning and inference algorithm described in Appendix G, which is a special case of the 
general algorithm described in Algorithm 2. We start by setting X 1 to the spectrogram data values and 
initializing all other variables and model parameters W l to random positive values. We have observed that 
model convergence can sometimes be improved by performing the parameter learning update steps more 
frequently on parameter matrices that are closer to the observed data in the graphical model. In this case, 
we perform W 1 more frequently than W 2 , and perform W 2 updates more frequently then W 3 and so on. 
For each level, W % was normalized so that the sum of the corresponding columns of all the W'!j sum to 1 . 

Figure 49 Shows the observed spectrogram X 1 and inferred activations for X 2 , X 3 , and X 4 after running 
the learning and inference algorithm for 800 iterations. Note that the activations become progressively more 
sparse as we ascend the hierarchy from X 2 to X 4 , even though sparse NMF update steps were not used in 
the inference and learning algorithm. 

Figure 50 shows the actual observed spectrogram as well as the approximation to the spectrogram ob- 
tained by using only the inferred top-level activations A 4 and propagating them downward to form an 
approximation to X 1 . The resulting RMSE was 0.07. Only the first 250 frequency bins are shown since the 
higher frequency bins have negligible energy 

Figure 51 Shows the observed spectrogram X 1 and inferred activations for X 2 , X 3 , and X 4 for the case 
where sparse NMF updates were used in the inference and learning algorithm. Here, we ran the inference and 
learning algorithm for 400 iterations with sparseness, and then increased the nsNMF sparseness parameter 
gradually from to 0.2 at 800 iterations. 

Figure 52 shows the actual observed spectrogram as well as the approximation to the spectrogram ob- 
tained by using only the inferred top-level activations X 4 and propagating them downward to form an 
approximation to X 1 . The resulting RMSE was 0.2. Only the first 250 frequency bins are shown since the 
higher frequency bins have negligible energy This is for gradually increasing sparse NMF. We observe that 
the results using sparse NMF updates appear slightly more sparse than those using the non-sparse NMF 
updates. However, the reconstruction error using sparse updates is also higher. 

We feel that it may be interesting to consider applying this or a similar DPFN to problems where a 
sparse hierarchical representation of sequential data could be useful, such as compression, noise reduction, 
and transcription. 
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Figure 49: Observed spectrogram X 1 and inferred activations for X 2 , X 3 , and X 4 . Sparseness was 0. 
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Figure 50: Actual vs approximated spectrogram X 1 for the sparseness case. 
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Figure 51: Observed spectrogram X 1 and inferred activations for X 2 , X 3 , and X 4 . Sparseness was gradually 
increased to a value of 0.2 at 800 iterations. 
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Figure 52: Actual vs approximated spectrogram X 1 for the sparseness case. 
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Figure 53: A DPFN representing a transition model over word features. The observed yt denotes a word 
vector, xt denotes a word feature vector, and h t denotes a word feature transition. The first four time slices 
are shown. 

7 A DPFN for language modeling 

In this section we propose a data model for the sequence of words in a text document, but do not present 
any corresponding empirical results. In this model, we perform dimensionality reduction by extracting low 
dimensional feature vectors corresponding to each distinct word. We then make use of a transition model 
for the feature vectors. By performing dimensionality reduction to extract features, we can use a much more 
compact transition model than would be possible by operating directly on the words themselves. The idea 
of extracting low-dimensional word features in order to model sequential word data in text documents and 
make predictions of future words was used in [13]. 

Figure 53 shows the graphical model. In this model, the observed {yt : t — 1 , . . . , T} correspond to the 
sequence of words that appear in some text document. Suppose our vocabulary size is N words. Then let 
y t denote an N dimensional vector such that the t'th word i G {1, . . . ,N} in the document is represented 
by setting the i'th component of y t to 1 and all other components to zero. This model can also be used for 
prediction of future words, given a sequence of words up to the present t p by simply considering the word 
vector nodes for future time slices {yt ,yt +i, • • • } to be hidden. 

We perform dimensionality reduction to extract low-dimensional word features Xt by left multiplying 
each word vector y t by a word feature matrix B so that we have: 



x t = By t 



(7.1) 



The non-negative matrix B then represents the word feature matrix, where the i, j'th component repre- 
sents the amount of feature i contained by word j. One might consider normalizing the columns of B to have 
unit column sum. Letting Y — [ j/i 2/2 J/3 • • • Vt ] , we then have the following matrix factorization 
for the word features: 



X = BY 



(7.2) 



The factorization equations that describe this model are exactly the same as those for the model in 
Figure 4 with the additional above factorization for the word features in terms of the word vectors. Thus, 
the complete system of factorizations for this model can be expressed as the two matrix factorizations (3.6) 
and (7.2). 

Suupose, for example, that the word "cat" has the features "animal" and "has claws." The feature 
transition model (parametrized by W\, W2) could possibly have distinct basis transitions containing the 
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"animal" and "has claws" features. For a word such as "cat" that has both features, both basis transitions 
could be simultaneously activated due to the non-negative superposition property (Section 2.1). Thus, a 
given word sequence could be explained such that each word has possibly multiple features, and that multiple 
distinct basis transitions in the feature transition model could simultaneously be activated to explain the 
words in a sequence as an additive combination of basis feature transitions. One could then think of the 
word feature transition model as operating on the individual word features independently and simultaneously 
such that the observed or inferred word sequence is consistent with them. We feel that such an additive 
factored language model might be a powerful way of modeling word and/or character sequences. However 
preliminary empirical results suggest that additional sparseness and/or orthogonality constraints might need 
to be imposed in order to obtain meaningful solutions. 

8 Discussion 

We have presented a framework for modeling non-negative data that retains the linear representation of NMF, 
while providing for more structured representations using (non-probabilistic) graphical models. The graphical 
model provides a visual representation of the factored correlation constraints between variables. We have 
presented algorithms for performing the inference and learning that leverage existing non-negative matrix 
factorization (NMF) algorithms. Since the algorithms are inherently parallel, we expect that optimized 
implementations for hardware such as multi-core CPUs and GPUs will be possible. These algorithm have 
the advantage of being quite straightforward to implement, even for relatively complex networks structures. 
We expect that they can be understood and implemented by anyone with a basic understanding of computer 
programming, matrix algebra, and graph theory, and require no background in probability theory. 

We presented empirical results for several example networks that illustrated the robustness of the inference 
and learning algorithms. Our goal was to provide a sufficient variety of model structures so that the interested 
reader will hopefully be able to extend this work and apply it to interesting and practical applications. We 
observed that in all models in which noiseless synthetic data sets were used, learning and/or inference 
converged to an essentially exact factorization, corresponding to a correct solution for the hidden variables. 
In the case of additive noise, the quality of the solutions tended to degrade gradually as the noise level was 
increased. In particular, in Section 3.2.4 we observed the somewhat surprising result that it was possible to 
learn an underlying transition model even when the training sequence contained only an additive mixture 
of realizations of the underlying model. In our magnitude spectrogram modeling example, we observed that 
the inference and learning solution appeared to correspond to a hierarchical sparse decomposition, with the 
solution sparseness increasing as we ascend the model hierarchy As with existing NMF algorithms, however, 
there is no guarantee that convergence to a global optimum for any particular cost function will occur. We 
did observe good convergence properties in our empirical results, however. We also proposed a PFN for 
language modeling and plan to experiment with related models as future research. 

It is still unknown how well models using this framework will perform on hard real-world problems such 
as speech recognition, language modeling, and music transcription, and this will be an interesting area of 
future research. It could also be interesting to explore the use of dynamic networks that are directional in 
time, such as DPFNs with a similar state space model to an HMM or HHMM, for example. Source code for 
implementing the models in this paper will be made available by the author. 
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A Algorithms for Non-negative Matrix Factorization 

In this section we present the non-negative matrix factorization (NMF) algorithms that were used to imple- 
ment the examples in this paper. NMF is a method for constructing an approximate factorization of a matrix 
subject to non-negativity constraints. It was originally proposed by [1] as positive matrix factorization. An 
observed non-negative M by T matrix X <G IR-°' MxT is factored as : 

X w WH (A.l) 

where the M by R matrix W € R-°' MxJi and the R by T matrix H £ IR^ > flxT arc the non-negative 
factor matrices. 

NMF can be viewed as a linear basis decomposition in which the columns of X represent the observed 
variables, the columns of W represent the basis vectors, and the columns of H represent the corresponding 
encoding variables. Let Xi, i«j, and hi denote the z'th columns of X, W, and H, respectively. Let hi(j) 
denote the j'th component in the column vector hi (i.e., Hji)- Then we have : 

x t K,Whi 

= [ w 1 w 2 ■ ■ ■ w R ] hi (A. 2) 

=w 1 h i (l) + w 2 hi{2) + ■■■ + w R hi{R) 

Each column x%,i = 1...T represents an observed variable that is modeled as a non-negative linear 
combination of the R non-negative basis vectors in W. The activations of the hidden variable hi specify the 
columns of W that will combine additively to form Xi- Since only additive combinations are possible, NMF 
can be interpreted as a parts-based representation. 

[2] developed simple and robust multiplicative update rules for minimizing the reconstruction error be- 
tween X and WH using a Frobcnius norm cost function and also a generalized KL divergence cost function 
given by: 

D(X\\WH) = ^(Xijlog-^i-- - X ti + (WH) l3 ) (A.3) 

i,j ,%3 



The NMF factorization is then given by: 

maiD{X\\WH) 

W,H 

s.t. W,H >0 

From the above cost function, [2] derived the following multiplicative update rules which are guaranteed 
to converge to a local minimum of the KL divergence cost function D(X\\H): 

W T ^- 

X jjT 

W <- W® ^ H UT (A.5) 

IxH 1 

where X ® Y denotes component- wise multiplication, the division is component-wise, and lx is a matrix 
of ones of the same size as X. The W and H are typically initialized to random positive values. We will 
refer to the W update step as the learning NMF update or left NMF update step and we will refer to the H 
update step as the inference NMF update or right NMF update step in this paper. In our data sets, we have 
observed that both the Frobenius norm and the KL divergence cost functions tend to produce qualitatively 
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similar results. However, we have observed that convergence tends to be faster when using the KL divergence 
cost function. We only consider the KL divergence cost function in this paper. 

In an implementation of these update rules, one must be careful to avoid division be zero. We prevent 
division by zero by adding a small positive value e to the numerator and denominator before performing the 
division operation, where e is small compared to the largest value in the numerator or denominator matrices. 
In our empirical results sections, we used e = 0.00001. This leads to the following update rules: 

W T ^+l\f +el H 

H <- H ® w£ 1 1 — ( A - 6 ) 



W <- W ® w " +el * (A.7) 

IxH 1 +e\ w 



B Sparse NMF 



Classical NMF tends to produce somewhat sparse solutions already. However, there may be cases in which 
we wish to control the amount of sparseness. There are many algorithms for performing NMF with sparse 
factorizations, such as [3], [4], [5]. We have implemented the nonSmooth NMF (nsNMF) algorithm [5] and 
used it where we want sparse solutions since it seems to perform well and is straightforward to implement. 
However, it is possible that other sparse NMF algorithms might also perform similarly well. 
The nsNMF method uses the factorization 

X = WSH (B.l) 

where 5* £ WL R x R is a symmetric smoothing matrix defined as 

S = (1 - $)I S + ^l s (B.2) 

where Is £ M iJxii is an identity matrix and as £ R fl x R is a matrix of ones. The sparseness parameter 
8 satisfies < < 1. As described in [5], the nsNMF update rules can use the update rules from Equations 
(A. 4), (A. 5) with the following substitutions: In the W update rule, replace H with SH . In the H update rule, 
replace W with WS. However, we used the update rules from Equations (A. 6), (A. 7) instead for numerical 
performance reasons. In [5], it was proposed to normalize the columns of W to sum to 1. However, in some 
cases we have observed that this additional constraint can actually result in less sparse solutions, since this 
constraint disallows zero-valued columns of W . Note that nsNMF reduces to the standard NMF update 
rules when 9 = 0. 

C Learning and inference algorithm 2 

Algorithm 5 and the corresponding procedure in Algorithm 6 show the pseudocode for a slightly modified 
inference and learning algorithm for a general PFN that uses a slightly different way of computing the variable 
averages than was used in Algorithm 2. In computing the mean values of the variables, the new mean for the 
model variables is computed from the the complete set of local variables. Thus, after performing the value 
propagation step, the updated values of all lower level variables in the model are a function of both the top 
level variables and the values from the previous inference step. In this algorithm, when a local variable vj, is 
updated, the new mean value is computed over all of the vj, associated with x/(j,fc), whereas in the previous 
algorithm (2), the new mean value is computed over possibly only a proper subset of the v J k associated with 
Xf(j.k) (i- e -i over the v k associated with the current level I). We have observed good convergence using this 
algorithm, although it can take longer to converge than algorithm 1. We did not perform any extensive 
comparison with algorithm 1. 
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Algorithm 5 Perform inference and learning 



Initialize hidden variables to random positive values 

// Main loop 

repeat 

// Bottom-to-top inference and learning 
for I = 1 to L - 1 
upStep(Z) 
averageLevel(7) 
end 

// Top-to-bottom value propagation 
for I = L — 1 downto 1 
downStep(7) 
averageLevel(7) 
end 
until convergence 



Algorithm 6 averageLevelQ procedure 



// Let Xi denote the set of model variables {x[, x l 2l ■ ■ ■ , x l LevelCount } corresponding to the level I nodes. 

// Let F actor Systemi denote the subset of the factorization equations from (2.18) 

// {eqj : such that v 3 Q in eqj corresponds to an x\ <G X{\. 

II Let duplication Set(i, I) — {(j, k) : eqj <G Factor System and v\ corresponds to x\}. 

II Let duplicationCount(i,l) = \duplicationSet(i,l)\. 

averageLevel(Z) 

for i = 1 to LevelCounti 

If For each (child) variable x\ € Xi 

if x\ is hidden 

mean Value = dupHcation c unt(i,l) ^J ^duplications 'et(i,l) V k 

for each (j, k) G duplication Set(i, I) 

vi <— meanValue 
end 

x\ *— meanValue 
else if x\ is observed 

for each (j,k) G duplications 'et(i, I) 



end 
end 
end 
end 



? i 

v l <- x i 
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D Learning and inference algorithm for the network in Section 3.1 

We now present a learning and inference algorithm for the network in Figure 4. The model corresponds to 
the variables {x t : t = 1, . . . , T} and {h t , t = 1, . . . , T — 1}. Let us start by partitioning these variables into 
a hidden set Xjj and an observed set Xe- From equation (3.6) we see that the T — 1 vector factorizations 
that describe the network can be compactly represented as a single matrix factorization equation. 

The learning and inference algorithm from Section 2.3 then corresponds to the pseudocode shown in 
Algorithm 7. We include the learning step in the algorithm, although it can be disabled if W. We start by 
initializing the components of X and H corresponding to Xe to the observed values, and initializing the 
components corresponding to Xh to random positive values. We then iterate the bottom-to-top inference 
and learning followed by top-to-bottom value propagation until convergence. The components of X and H 
can be thought of as the model variables. The components of Xc can be thought of as the local variables 
fg corresponding to the model variables in X. Note from Equation (3.2) that each model variable x t except 
X\ and xt corresponds to two local variables in Xc- The components of H can be thought of as serving 
as both model variables and local variables since no model variables h t are duplicated in H. If W is to be 
learned then it is also initialized to random positive values. 

E Learning and inference algorithm for the network in Section 4.1 

We now present a learning and inference algorithm for the network in Figure 31. A learning and inference 
algorithm for this network can be obtained by applying Algorithm 2 to the system of factorizations (4. 5), (4. 7), 
and (4.9). The psuedocode is shown in Algorithm 8. 

Here, the averageParents{< matrix >) procedure is analogous to the averageParentsQ procedure of 
Algorithm 2. In averageParents{< matrix >), the mean value of each set of local variables in < matrix > 
corresponding to the same model variable is computed. All hidden components in < matrix > are then 
updated to the corresponding mean values and all observed components are updated to the corresponding 
values in Xe- 

The averageC'hildren(< matrix! >, < matrixl >) and averageC'hildren(< matrixl >) procedures 
are analogous to the averageChildrenQ procedure of Algorithm 2. Here, jmatrixl^ corresponds to the 
model variables and jmatrix2^ corresponds to the local variables. The mean value of each set of local 
variables in < matrixl > corresponding to the same model variable is computed. All hidden components 
in < matrixl > and < matrixl > are then updated to the corresponding mean values and all observed 
components are updated to the corresponding values in Xe- 

F Learning and inference algorithm for the target tracking net- 
work in Section 5 

We now present a learning and inference algorithm for the target tracking network in Figure 37. A learn- 
ing and inference algorithm for this network can be obtained by applying Algorithm 2 to the system of 
factorizations () - (). The pseudocode is shown in Algorithm 9. Note that variables {x\\ corresponding 
to A 1 are observed or partially observed and all other model variables were hidden in the experimental 
examples. However, the general form of the inference and learning algorithm remains the same regardless of 
any particular choice of variable partitioning into Xh and Xe- 

G Learning and inference algorithm for the network in Section 6 

We now present a learning and inference algorithm for a network with L levels in Section 6. The network 
has L levels, corresponding to layers X£, Xf, . . . ,X^. Any values for the number of children nodes p and 
node separation q are allowed, and can even be distinct for each level. In the example in this paper X 1 
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Algorithm 7 Perform inference and learning for the network in Figure 4 

// Main loop 
repeat 

// Bottom-to-top inference and learning 

upStep(X c ,W,H) 

averageParents (H) 

II Top-to-bottom value propogation 

downStep(X c , W,H) 

averagcChildren(X, Xc) 
until convergence 
// Procedure: upStcp() 
upStep(TU, H) 

if learning is enabled 

Learning update: Using Xq — WH, perform a left NMF update on W, using, e.g. (A. 7) 

end 

Inference update: Using Xc — WH, perform a right NMF update on H, using, e.g. (A. 6) 
end 

// Procedure: averageParentsQ 
averageParents(i^) 

Overwrite all observed components in H with the corresponding observed values from Xe- 
end 

// Procedure: downStepQ 
downstep(A c , W, H) 

X C <~WH 
end 

// Procedure: averageChildrcnQ 
averageChildren(X, Xc) 

II Compute the mean value of each pair of components in Xc that correspond to the same 

// hidden model variable in X. Overwrite all hidden components in X and Xc with 

// the computed mean values. 

(X C ,X) <— meanValues(Xc) 

Overwrite all observed components in X and Xc with the corresponding values in Xe- 
end 
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Algorithm 8 Perform inference and learning for the network in Figure 31 

// Main loop 
repeat 

// Bottom-to-top inference and learning 

upStepp^Ty 1 ,^ 1 ) 

averageParents(iJ 1 ) 

upStep(X^,l^ 2 ,^ 2 ) 

averageParents (H 2 ) 

tt2 "| 

,U,V) 



upStep( 



ff 1 



averageParents (V) 

// Top-to-bottom value propogation 

" H 2 ' 



H 1 



downStep( 
averageChildrcn( 



) 



,U,V) 

H 2 
H 1 
downStep(A 2 ,VF 2 ,ff 2 )~ 
averageChildren(X 2 , X^) 
downStep(X^,W 1 ,jy 1 ) 
averageChildrcn(X 1 , Xq) 
until convergence 



is observed and all other X % are hidden, but any partitioning of the model variables into Xh and Xe is 
allowed. A learning and inference algorithm for this network can be obtained by applying Algorithm 2 to 
the system of factorizations (6.14). The pseudocode is shown in Algorithm 10. 
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Algorithm 9 Perform inference and learning for the network in Figure 31 



// Main loop 
repeat 

// Bottom-to-top inference and learning 



upStep( 

averagel 
upStep( 



,W 1 ,U 1 ) // Factorization (5.11) 



,W 3 ,U 2 ) // Factorization (5.5) 



, W 7 , U 3 ) II Factorization (5.3) 



Y 3 

x% 

X 1 
averageParents ( U 1 ) 

1 Copy 1 
L Copy 2 

averageParents ( U 2 ) 
upStep( y4 

A Copy2 

averageParents ( U 3 ) 

upStepp^TF 2 ,// 1 ) // Factorization (5.7) 

averageParents(iJ 1 ) 

upStep(X^W 6 ,H 3 ) // Factorization (5.1) 

averageParents (H 3 ) 

H 2 

H 1 

averageParents (V 1 ) 

T H 3 
upStep( 2 

L n Copy\ 

averageParents ( V 2 ) 



upStep( 



, W 4 , V 1 ) I) Factorization (5.9) 



, W 5 , V 2 ) II Factorization (5.8) 



// Top-to-bottom value propogation 
H 3 



downStcp( 



tt2 

n Copy2 

averagcChildrcn(i7 3 ) 



,W 5 ,V 2 ) I) Factorization (5.8) 



downStep( 



H 2 
H 1 



,W 4 ,V 1 ) I) Factorization (5.9) 



averageChildren(iJ 1 ) 
averageChildren(# 2 , H% opyl , H% opy2 ) 
downStep(X 4 , W 6 ,H 3 ) // Factorization (5.1) 



downStep( 
averagcChildren {X 5 ) 



X b 

Y 4 
^Copy2 



downStep( 



Y 4 
^Copyl 

Y 3 
yi -Copy2 



,W 7 ,U 3 ) I) Factorization (5.3) 



,W 3 ,U 2 ) If Factorization (5.5) 



averageChildren(X£, X% opyY , X% opy2 ) 
downStep(X^,VF 2 ,iJ 1 ) // Factorization (5.7) 



downStep( 



y3 

^■Copyl 
X* 
X 1 



averageChildren(X^ opyl , ^ Copy2 



, W 1 ,!/ 1 ) I) Factorization (5.11) 
X 3 



averageChildrcn(X, X 2 ,) 
averageChildren(X 1 ) 
until convergence 
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Algorithm 10 Perform inference and learning for an L level network in Section 6 

// Main loop 
repeat 

// Bottom-to-top inference and learning 
for I = 1 to L - 1 

upStep(X\W l ,X l ? +1 ) // Factorization (6.14) 

averageParents(X^ +1 , X l+1 ) 
end 

// Top-to-bottom value propogation 
for I = L — 1 downto 1 

downStep(A\VP,Al, +1 ) // Factorization (6.14) 

averagcChildren(X 4 , X l s ) 
end 
until convergence 
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